# Backend Production Readiness & Code Quality Improvements ## Overview Implement critical production-ready features, refactor architectural issues, and improve code quality for the Tercul backend. The codebase uses Go 1.25, follows DDD/CQRS patterns, GraphQL API, and clean architecture principles. ## Critical Issues to Resolve ### 1. Implement Full-Text Search Service (P0 - Critical) **Problem**: The search service in `internal/app/search/service.go` is a stub that returns empty results. This is a core feature that users depend on. **Current State**: - `Search()` method returns empty results (line 31-39) - `IndexWork()` is partially implemented but search logic missing - Weaviate client exists but not utilized for search - Search filters are defined but not applied **Affected Files**: - `internal/app/search/service.go` - Main search service (stub implementation) - `internal/platform/search/weaviate_wrapper.go` - Weaviate client wrapper - `internal/domain/search/search.go` - Search domain interfaces - GraphQL resolvers that use search service **Solution**: 1. Implement full Weaviate search query in `Search()` method: - Query Weaviate for works, translations, and authors - Apply search filters (language, type, date range, tags, authors) - Support multi-language search (Russian, English, Tatar) - Implement relevance ranking - Add pagination support - Handle special characters and diacritics 2. Enhance indexing: - Index work titles, content, and metadata - Index translation content with language tags - Index author names and biographies - Add incremental indexing on create/update operations - Create background job for bulk indexing existing content 3. Add search result transformation: - Map Weaviate results to domain entities - Include relevance scores - Handle empty results gracefully - Add search analytics/metrics **Acceptance Criteria**: - Search returns relevant results ranked by relevance - Supports filtering by language, category, tags, authors, date ranges - Search response time < 200ms for 95th percentile - Handles multi-language queries correctly - All existing tests pass - Integration tests with real Weaviate instance ### 2. Refactor Global Configuration Singleton (P1 - High Priority) **Problem**: The application uses a global singleton `config.Cfg` which violates dependency injection principles and makes testing difficult. **Current State**: - `internal/platform/config/config.go` has global `var Cfg *Config` - `config.Cfg` is accessed directly in multiple places: - `internal/platform/search/bleve_client.go` (line 13) - Various other packages **Affected Files**: - `internal/platform/config/config.go` - Global config singleton - `internal/platform/search/bleve_client.go` - Uses `config.Cfg` - `cmd/api/main.go` - Loads config but also sets global - `cmd/worker/main.go` - Similar pattern - Any other files accessing `config.Cfg` directly **Solution**: 1. Remove global `Cfg` variable from config package 2. Refactor `LoadConfig()` to return config without setting global 3. Pass `*config.Config` as dependency to all constructors: - Update `NewBleveClient()` to accept config parameter - Update all repository constructors to accept config - Update application service constructors - Update platform service constructors 4. Update main entry points: - `cmd/api/main.go` - Pass config to all dependencies - `cmd/worker/main.go` - Pass config to all dependencies - `cmd/tools/enrich/main.go` - Pass config to dependencies 5. Make configuration more flexible: - Make migration path configurable (currently hardcoded) - Make metrics server port configurable - Add validation for required config values - Add config struct tags for better documentation **Acceptance Criteria**: - No global `config.Cfg` usage anywhere in codebase - All dependencies receive config via constructor injection - Tests can easily mock/inject different configs - Configuration validation on startup - Backward compatible (same environment variables work) ### 3. Enhance Observability: Distributed Tracing (P0 - Critical) **Problem**: Tracing is implemented but only exports to stdout. Need production-ready tracing with OTLP exporter and proper instrumentation. **Current State**: - `internal/observability/tracing.go` uses `stdouttrace` exporter - Basic tracer provider exists but not production-ready - Missing instrumentation in many places **Affected Files**: - `internal/observability/tracing.go` - Only stdout exporter - HTTP middleware - May need tracing instrumentation - GraphQL resolvers - Need span creation - Database queries - Need query tracing - Application services - Need business logic spans **Solution**: 1. Replace stdout exporter with OTLP exporter: - Add OTLP exporter configuration - Support both gRPC and HTTP OTLP endpoints - Add environment-based configuration (dev vs prod) - Add trace sampling strategy (100% dev, 10% prod) 2. Enhance instrumentation: - Add automatic HTTP request tracing in middleware - Instrument all GraphQL resolvers with spans - Add database query spans via GORM callbacks - Create custom spans for slow operations (>100ms) - Add span attributes (user_id, work_id, etc.) 3. Add trace context propagation: - Ensure trace IDs propagate through all layers - Add trace ID to structured logs - Support distributed tracing across services 4. Configuration: ```go type TracingConfig struct { Enabled bool ServiceName string OTLPEndpoint string SamplingRate float64 Environment string } ``` **Acceptance Criteria**: - Traces exported to OTLP collector (Jaeger/Tempo compatible) - All HTTP requests have spans - All GraphQL resolvers traced - Database queries have spans - Trace IDs in logs - Sampling configurable per environment ### 4. Enhance Observability: Prometheus Metrics (P0 - Critical) **Problem**: Basic metrics exist but need enhancement for production monitoring and alerting. **Current State**: - `internal/observability/metrics.go` has basic HTTP and DB metrics - Missing business metrics, GraphQL-specific metrics - No Grafana dashboards or alerting rules **Affected Files**: - `internal/observability/metrics.go` - Basic metrics - GraphQL resolvers - Need resolver metrics - Application services - Need business metrics - Background jobs - Need job metrics **Solution**: 1. Add GraphQL-specific metrics: - `graphql_resolver_duration_seconds{operation, resolver}` - `graphql_errors_total{operation, error_type}` - `graphql_operations_total{operation, status}` 2. Add business metrics: - `works_created_total{language}` - `searches_performed_total{type}` - `user_registrations_total` - `translations_created_total{language}` - `likes_total{entity_type}` 3. Enhance existing metrics: - Add more labels to HTTP metrics (status code as number) - Add query type labels to DB metrics - Add connection pool metrics - Add cache hit/miss metrics 4. Create observability package structure: - Move metrics to `internal/observability/metrics/` - Add metric collection helpers - Document metric naming conventions **Acceptance Criteria**: - All critical paths have metrics - GraphQL operations fully instrumented - Business metrics tracked - Metrics exposed on `/metrics` endpoint - Metric labels follow Prometheus best practices ### 5. Implement Read Models (DTOs) for Efficient Queries (P1 - High Priority) **Problem**: Application queries return full domain entities, which is inefficient and leaks domain logic to API layer. **Current State**: - Queries in `internal/app/*/queries.go` return domain entities - GraphQL resolvers receive full entities with all fields - No optimization for list vs detail views **Affected Files**: - `internal/app/work/queries.go` - Returns `domain.Work` - `internal/app/translation/queries.go` - Returns `domain.Translation` - `internal/app/author/queries.go` - Returns `domain.Author` - GraphQL resolvers - Receive full entities **Solution**: 1. Create DTO packages: - `internal/app/work/dto` - WorkListDTO, WorkDetailDTO - `internal/app/translation/dto` - TranslationListDTO, TranslationDetailDTO - `internal/app/author/dto` - AuthorListDTO, AuthorDetailDTO 2. Define optimized DTOs: ```go // WorkListDTO - For list views (minimal fields) type WorkListDTO struct { ID uint Title string AuthorName string AuthorID uint Language string CreatedAt time.Time ViewCount int LikeCount int TranslationCount int } // WorkDetailDTO - For single work view (all fields) type WorkDetailDTO struct { *WorkListDTO Content string Description string Tags []string Translations []TranslationSummaryDTO Author AuthorSummaryDTO } ``` 3. Refactor queries to return DTOs: - Update query methods to use optimized SQL - Use joins to avoid N+1 queries - Map domain entities to DTOs - Update GraphQL resolvers to use DTOs 4. Add benchmarks comparing old vs new approach **Acceptance Criteria**: - List queries return optimized DTOs - Detail queries return full DTOs - No N+1 query problems - Payload size reduced by 30-50% - Query response time improved by 20% - No breaking changes to GraphQL schema ### 6. Improve Structured Logging (P1 - High Priority) **Problem**: Logging exists but lacks request context, user IDs, and trace correlation. **Current State**: - `internal/platform/log` uses zerolog - Basic logging but missing context - No request ID propagation - No user ID in logs - No trace/span ID correlation **Affected Files**: - `internal/platform/log/logger.go` - Basic logger - HTTP middleware - Needs request ID injection - All application services - Need context logging **Solution**: 1. Enhance HTTP middleware: - Generate request ID for each request - Inject request ID into context - Add user ID from JWT to context - Add trace/span IDs to context 2. Update logger to use context: - Extract request ID, user ID, trace ID from context - Add to all log entries automatically - Create helper: `log.FromContext(ctx).WithRequestID().WithUserID()` 3. Add structured logging fields: - Define field name constants - Ensure consistent field names across codebase - Add sensitive data redaction 4. Implement log sampling: - Sample high-volume endpoints (e.g., health checks) - Configurable sampling rates - Always log errors regardless of sampling **Acceptance Criteria**: - All logs include request ID - Authenticated request logs include user ID - All logs include trace/span IDs - Consistent log format across codebase - Sensitive data excluded from logs - Log sampling for high-volume endpoints ### 7. Refactor Caching with Decorator Pattern (P1 - High Priority) **Problem**: Current caching implementation uses bespoke cached repositories. Should use decorator pattern for better maintainability. **Current State**: - `internal/data/cache` has custom caching logic - Cached repositories are separate implementations - Not following decorator pattern **Affected Files**: - `internal/data/cache/*` - Current caching implementation - Repository interfaces - Need to support decorators **Solution**: 1. Implement decorator pattern: - Create `CachedWorkRepository` decorator - Create `CachedAuthorRepository` decorator - Create `CachedTranslationRepository` decorator - Decorators wrap base repositories 2. Implement cache-aside pattern: - Check cache on read, populate on miss - Invalidate cache on write operations - Add cache key versioning strategy 3. Add cache configuration: - TTL per entity type - Cache size limits - Cache warming strategies 4. Add cache metrics: - Hit/miss rates - Cache size - Eviction counts **Acceptance Criteria**: - Decorator pattern implemented - Cache hit rate > 70% for reads - Automatic cache invalidation on updates - Cache failures don't break application - Metrics for cache performance ### 8. Complete API Documentation (P1 - High Priority) **Problem**: API documentation is incomplete. Need comprehensive GraphQL API documentation. **Current State**: - GraphQL schema exists but lacks descriptions - No example queries - No API guide for consumers **Affected Files**: - GraphQL schema files - Need descriptions - `api/README.md` - Needs comprehensive guide - All resolver implementations - Need documentation **Solution**: 1. Add descriptions to GraphQL schema: - Document all types, queries, mutations - Add field descriptions - Document input validation rules - Add deprecation notices where applicable 2. Create comprehensive API documentation: - `api/README.md` - Complete API guide - `api/EXAMPLES.md` - Query examples - Document authentication requirements - Document rate limiting - Document error responses 3. Enhance GraphQL Playground: - Pre-populate with example queries - Add query templates - Document schema changes **Acceptance Criteria**: - All 80+ GraphQL resolvers documented - Example queries for each operation - Input validation rules documented - Error response examples - Authentication requirements clear - API changelog maintained ### 9. Refactor Testing Utilities (P2 - Medium Priority) **Problem**: Tests depend on live database connections, making them slow and unreliable. **Current State**: - `internal/testutil/testutil.go` has database connection logic - Integration tests require live database - Tests are slow and may be flaky **Affected Files**: - `internal/testutil/testutil.go` - Database connection logic - All integration tests - Depend on live DB **Solution**: 1. Decouple tests from live database: - Remove database connection from testutil - Use test containers for integration tests - Use mocks for unit tests 2. Improve test utilities: - Create test data builders - Add fixtures for common scenarios - Improve test isolation 3. Add parallel test execution: - Enable `-parallel` flag where safe - Use test-specific database schemas - Clean up test data properly **Acceptance Criteria**: - Unit tests run without database - Integration tests use test containers - Tests run in parallel where possible - Test execution time < 5 seconds for unit tests - Clear separation between unit and integration tests ### 10. Implement Analytics Features (P2 - Medium Priority) **Problem**: Analytics service exists but some metrics are stubs (like, comment, bookmark counting). **Current State**: - `internal/jobs/linguistics/work_analysis_service.go` has TODO comments: - Line 184: ViewCount TODO - Line 185: LikeCount TODO - Line 186: CommentCount TODO - Line 187: BookmarkCount TODO - Line 188: TranslationCount TODO - Line 192: PopularTranslations TODO **Affected Files**: - `internal/jobs/linguistics/work_analysis_service.go` - Stub implementations - `internal/app/analytics/*` - Analytics services **Solution**: 1. Implement counting services: - Like counting service - Comment counting service - Bookmark counting service - Translation counting service - View counting service 2. Implement popular translations calculation: - Calculate based on likes, comments, bookmarks - Cache results for performance - Update periodically via background job 3. Add analytics to work analysis: - Integrate counting services - Update WorkAnalytics struct - Ensure data is accurate and up-to-date **Acceptance Criteria**: - All analytics metrics implemented - Popular translations calculated correctly - Analytics updated in real-time or near-real-time - Performance optimized (cached where appropriate) - Tests for all analytics features ## Implementation Guidelines 1. **Architecture First**: Maintain clean architecture, DDD, and CQRS patterns 2. **Backward Compatibility**: Ensure API contracts remain consistent 3. **Code Quality**: - Follow Go best practices and idioms - Use interfaces for testability - Maintain separation of concerns - Add comprehensive error handling 4. **Testing**: Write tests for all new features and refactorings 5. **Documentation**: Add GoDoc comments for all public APIs 6. **Performance**: Optimize for production workloads 7. **Observability**: Instrument all critical paths ## Expected Outcome - Production-ready search functionality - Proper dependency injection (no globals) - Full observability (tracing, metrics, logging) - Optimized queries with DTOs - Comprehensive API documentation - Fast, reliable test suite - Complete analytics features - Improved code maintainability ## Files to Prioritize 1. `internal/app/search/service.go` - Core search implementation (P0) 2. `internal/platform/config/config.go` - Configuration refactoring (P1) 3. `internal/observability/*` - Observability enhancements (P0) 4. `internal/app/*/queries.go` - DTO implementation (P1) 5. `internal/platform/log/*` - Logging improvements (P1) 6. `api/README.md` - API documentation (P1) ## Notes - Codebase uses Go 1.25 - Follows DDD/CQRS/Clean Architecture patterns - GraphQL API with gqlgen - PostgreSQL with GORM - Weaviate for vector search - Redis for caching and job queue - Docker for local development - Existing tests should continue to pass - Follow existing code style and patterns