From b5cd1761afd26f2cdfed771b8d6cd2330fdde40a Mon Sep 17 00:00:00 2001 From: Damir Mukimov Date: Sun, 30 Nov 2025 03:12:44 +0100 Subject: [PATCH] Update workflows and tasks documentation --- Dockerfile | 2 +- PRODUCTION-TASKS.md | 963 ++++++++++++++++++++ TASKS.md | 44 +- docs/architecture/IMPLEMENTATION_SUMMARY.md | 39 +- jules-task.md | 503 ++++++++++ 5 files changed, 1525 insertions(+), 26 deletions(-) create mode 100644 PRODUCTION-TASKS.md create mode 100644 jules-task.md diff --git a/Dockerfile b/Dockerfile index 1d9650d..8fdbb72 100644 --- a/Dockerfile +++ b/Dockerfile @@ -34,4 +34,4 @@ COPY --from=builder /app/tercul . EXPOSE 8080 # Command to run the application -CMD ["./tercul"] \ No newline at end of file +CMD ["./tercul"] diff --git a/PRODUCTION-TASKS.md b/PRODUCTION-TASKS.md new file mode 100644 index 0000000..00d792a --- /dev/null +++ b/PRODUCTION-TASKS.md @@ -0,0 +1,963 @@ +# Tercul Backend - Production Readiness Tasks + +**Generated:** November 27, 2025 +**Current Status:** Most core features implemented, needs production hardening + +> **⚠️ MIGRATED TO GITHUB ISSUES** +> +> All production readiness tasks have been migrated to GitHub Issues for better tracking. +> See issues #30-38 in the repository: +> +> This document is kept for reference only and should not be used for task tracking. + +--- + +## 📊 Current Reality Check + +### ✅ What's Actually Working + +- ✅ Full GraphQL API with 90%+ resolvers implemented +- ✅ Complete CQRS pattern (Commands & Queries) +- ✅ Auth system (Register, Login, JWT, Password Reset, Email Verification) +- ✅ Work CRUD with authorization +- ✅ Translation management with analytics +- ✅ User management and profiles +- ✅ Collections, Comments, Likes, Bookmarks +- ✅ Contributions with review workflow +- ✅ Analytics service (views, likes, trending) +- ✅ Clean Architecture with DDD patterns +- ✅ Comprehensive test coverage (passing tests) +- ✅ CI/CD pipelines (build, test, lint, security, docker) +- ✅ Docker setup and containerization +- ✅ Database migrations and schema + +### ⚠️ What Needs Work + +- ⚠️ Search functionality (stub implementation) → **Issue #30** +- ⚠️ Observability (metrics, tracing) → **Issues #31, #32, #33** +- ⚠️ Production deployment automation → **Issue #36** +- ⚠️ Performance optimization → **Issues #34, #35** +- ⚠️ Security hardening → **Issue #37** +- ⚠️ Infrastructure as Code → **Issue #38** + +--- + +## 🎯 EPIC 1: Search & Discovery (HIGH PRIORITY) + +### Story 1.1: Full-Text Search Implementation + +**Priority:** P0 (Critical) +**Estimate:** 8 story points (2-3 days) +**Labels:** `enhancement`, `search`, `backend` + +**User Story:** + +``` +As a user exploring literary works, +I want to search across works, translations, and authors by keywords, +So that I can quickly find relevant content in my preferred language. +``` + +**Acceptance Criteria:** + +- [ ] Implement Weaviate-based full-text search for works +- [ ] Index work titles, content, and metadata +- [ ] Support multi-language search (Russian, English, Tatar) +- [ ] Search returns relevance-ranked results +- [ ] Support filtering by language, category, tags, authors +- [ ] Support date range filtering +- [ ] Search response time < 200ms for 95th percentile +- [ ] Handle special characters and diacritics correctly + +**Technical Tasks:** + +1. Complete `internal/app/search/service.go` implementation +2. Implement Weaviate schema for Works, Translations, Authors +3. Create background indexing job for existing content +4. Add incremental indexing on create/update operations +5. Implement search query parsing and normalization +6. Add search result pagination and sorting +7. Create integration tests for search functionality +8. Add search metrics and monitoring + +**Dependencies:** + +- Weaviate instance running (already in docker-compose) +- `internal/platform/search` client (exists) +- `internal/domain/search` interfaces (exists) + +**Definition of Done:** + +- All acceptance criteria met +- Unit tests passing (>80% coverage) +- Integration tests with real Weaviate instance +- Performance benchmarks documented +- Search analytics tracked + +--- + +### Story 1.2: Advanced Search Filters + +**Priority:** P1 (High) +**Estimate:** 5 story points (1-2 days) +**Labels:** `enhancement`, `search`, `backend` + +**User Story:** + +``` +As a researcher or literary enthusiast, +I want to filter search results by multiple criteria simultaneously, +So that I can narrow down to exactly the works I'm interested in. +``` + +**Acceptance Criteria:** + +- [ ] Filter by literature type (poetry, prose, drama) +- [ ] Filter by time period (creation date ranges) +- [ ] Filter by multiple authors simultaneously +- [ ] Filter by genre/categories +- [ ] Filter by language availability +- [ ] Combine filters with AND/OR logic +- [ ] Save search filters as presets (future) + +**Technical Tasks:** + +1. Extend `SearchFilters` domain model +2. Implement filter translation to Weaviate queries +3. Add faceted search capabilities +4. Implement filter validation +5. Add filter combination logic +6. Create filter preset storage (optional) +7. Add tests for all filter combinations + +--- + +## 🎯 EPIC 2: API Documentation (HIGH PRIORITY) + +### Story 2.1: Comprehensive GraphQL API Documentation + +**Priority:** P1 (High) +**Estimate:** 5 story points (1-2 days) +**Labels:** `documentation`, `api`, `devex` + +**User Story:** + +``` +As a frontend developer or API consumer, +I want complete documentation for all GraphQL queries and mutations, +So that I can integrate with the API without constantly asking questions. +``` + +**Acceptance Criteria:** + +- [ ] Document all 80+ GraphQL resolvers +- [ ] Include example queries for each operation +- [ ] Document input types and validation rules +- [ ] Provide error response examples +- [ ] Document authentication requirements +- [ ] Include rate limiting information +- [ ] Add GraphQL Playground with example queries +- [ ] Auto-generate docs from schema annotations + +**Technical Tasks:** + +1. Add descriptions to all GraphQL types in schema +2. Document each query/mutation with examples +3. Create `api/README.md` with comprehensive guide +4. Add inline schema documentation +5. Set up GraphQL Voyager for schema visualization +6. Create API changelog +7. Add versioning documentation +8. Generate OpenAPI spec for REST endpoints (if any) + +**Deliverables:** + +- `api/README.md` - Complete API guide +- `api/EXAMPLES.md` - Query examples +- `api/CHANGELOG.md` - API version history +- Enhanced GraphQL schema with descriptions +- Interactive API explorer + +--- + +### Story 2.2: Developer Onboarding Documentation + +**Priority:** P1 (High) +**Estimate:** 3 story points (1 day) +**Labels:** `documentation`, `devex` + +**User Story:** + +``` +As a new developer joining the project, +I want clear setup instructions and architecture documentation, +So that I can become productive quickly without extensive hand-holding. +``` + +**Acceptance Criteria:** + +- [ ] Updated `README.md` with quick start guide +- [ ] Architecture diagrams and explanations +- [ ] Development workflow documentation +- [ ] Testing strategy documentation +- [ ] Contribution guidelines +- [ ] Code style guide +- [ ] Troubleshooting common issues + +**Technical Tasks:** + +1. Update root `README.md` with modern structure +2. Create `docs/ARCHITECTURE.md` with diagrams +3. Document CQRS and DDD patterns used +4. Create `docs/DEVELOPMENT.md` workflow guide +5. Document testing strategy in `docs/TESTING.md` +6. Create `CONTRIBUTING.md` guide +7. Add package-level `README.md` for complex packages + +**Deliverables:** + +- Refreshed `README.md` +- `docs/ARCHITECTURE.md` +- `docs/DEVELOPMENT.md` +- `docs/TESTING.md` +- `CONTRIBUTING.md` + +--- + +## 🎯 EPIC 3: Observability & Monitoring (CRITICAL FOR PRODUCTION) + +### Story 3.1: Distributed Tracing with OpenTelemetry + +**Priority:** P0 (Critical) +**Estimate:** 8 story points (2-3 days) +**Labels:** `observability`, `monitoring`, `infrastructure` + +**User Story:** + +``` +As a DevOps engineer monitoring production, +I want distributed tracing across all services and database calls, +So that I can quickly identify performance bottlenecks and errors. +``` + +**Acceptance Criteria:** + +- [ ] OpenTelemetry SDK integrated +- [ ] Automatic trace context propagation +- [ ] All HTTP handlers instrumented +- [ ] All database queries traced +- [ ] All GraphQL resolvers traced +- [ ] Custom spans for business logic +- [ ] Traces exported to OTLP collector +- [ ] Integration with Jaeger/Tempo + +**Technical Tasks:** + +1. Add OpenTelemetry Go SDK dependencies +2. Create `internal/observability/tracing` package +3. Instrument HTTP middleware with auto-tracing +4. Add database query tracing via GORM callbacks +5. Instrument GraphQL execution +6. Add custom spans for slow operations +7. Set up trace sampling strategy +8. Configure OTLP exporter +9. Add Jaeger to docker-compose for local dev +10. Document tracing best practices + +**Configuration:** + +```go +// Example trace configuration +type TracingConfig struct { + Enabled bool + ServiceName string + SamplingRate float64 + OTLPEndpoint string +} +``` + +--- + +### Story 3.2: Prometheus Metrics & Alerting + +**Priority:** P0 (Critical) +**Estimate:** 5 story points (1-2 days) +**Labels:** `observability`, `monitoring`, `metrics` + +**User Story:** + +``` +As a site reliability engineer, +I want detailed metrics on API performance and system health, +So that I can detect issues before they impact users. +``` + +**Acceptance Criteria:** + +- [ ] HTTP request metrics (latency, status codes, throughput) +- [ ] Database query metrics (query time, connection pool) +- [ ] Business metrics (works created, searches performed) +- [ ] System metrics (memory, CPU, goroutines) +- [ ] GraphQL-specific metrics (resolver performance) +- [ ] Metrics exposed on `/metrics` endpoint +- [ ] Prometheus scraping configured +- [ ] Grafana dashboards created + +**Technical Tasks:** + +1. Enhance existing Prometheus middleware +2. Add HTTP handler metrics (already partially done) +3. Add database query duration histograms +4. Create business metric counters +5. Add GraphQL resolver metrics +6. Create custom metrics for critical paths +7. Set up metric labels strategy +8. Create Grafana dashboard JSON +9. Define SLOs and SLIs +10. Create alerting rules YAML + +**Key Metrics:** + +``` +# HTTP Metrics +http_requests_total{method, path, status} +http_request_duration_seconds{method, path} + +# Database Metrics +db_query_duration_seconds{query} +db_connections_current +db_connections_max + +# Business Metrics +works_created_total{language} +searches_performed_total{type} +user_registrations_total + +# GraphQL Metrics +graphql_resolver_duration_seconds{operation, resolver} +graphql_errors_total{operation, error_type} +``` + +--- + +### Story 3.3: Structured Logging Enhancements + +**Priority:** P1 (High) +**Estimate:** 3 story points (1 day) +**Labels:** `observability`, `logging` + +**User Story:** + +``` +As a developer debugging production issues, +I want rich, structured logs with request context, +So that I can quickly trace requests and identify root causes. +``` + +**Acceptance Criteria:** + +- [ ] Request ID in all logs +- [ ] User ID in authenticated request logs +- [ ] Trace ID/Span ID in all logs +- [ ] Consistent log levels across codebase +- [ ] Sensitive data excluded from logs +- [ ] Structured fields for easy parsing +- [ ] Log sampling for high-volume endpoints + +**Technical Tasks:** + +1. Enhance HTTP middleware to inject request ID +2. Add user ID to context from JWT +3. Add trace/span IDs to logger context +4. Audit all logging statements for consistency +5. Add field name constants for structured logging +6. Implement log redaction for passwords/tokens +7. Add log sampling configuration +8. Create log aggregation guide (ELK/Loki) + +**Log Format Example:** + +```json +{ + "level": "info", + "ts": "2025-11-27T10:30:45.123Z", + "msg": "Work created successfully", + "request_id": "req_abc123", + "user_id": "user_456", + "trace_id": "trace_xyz789", + "span_id": "span_def321", + "work_id": 789, + "language": "en", + "duration_ms": 45 +} +``` + +--- + +## 🎯 EPIC 4: Performance Optimization (MEDIUM PRIORITY) + +### Story 4.1: Read Models (DTOs) for Efficient Queries + +**Priority:** P1 (High) +**Estimate:** 8 story points (2-3 days) +**Labels:** `performance`, `architecture`, `refactoring` + +**User Story:** + +``` +As an API consumer, +I want fast query responses with only the data I need, +So that my application loads quickly and uses less bandwidth. +``` + +**Acceptance Criteria:** + +- [ ] Create DTOs for all list queries +- [ ] DTOs include only fields needed by API +- [ ] Avoid N+1 queries with proper joins +- [ ] Reduce payload size by 30-50% +- [ ] Query response time improved by 20% +- [ ] No breaking changes to GraphQL schema + +**Technical Tasks:** + +1. Create `internal/app/work/dto` package +2. Define WorkListDTO, WorkDetailDTO +3. Create TranslationListDTO, TranslationDetailDTO +4. Define AuthorListDTO, AuthorDetailDTO +5. Implement optimized SQL queries for DTOs +6. Update query services to return DTOs +7. Update GraphQL resolvers to map DTOs +8. Add benchmarks comparing old vs new +9. Update tests to use DTOs +10. Document DTO usage patterns + +**Example DTO:** + +```go +// WorkListDTO - Optimized for list views +type WorkListDTO struct { + ID uint + Title string + AuthorName string + AuthorID uint + Language string + CreatedAt time.Time + ViewCount int + LikeCount int + TranslationCount int +} + +// WorkDetailDTO - Full information for single work +type WorkDetailDTO struct { + *WorkListDTO + Content string + Description string + Tags []string + Categories []string + Translations []TranslationSummaryDTO + Author AuthorSummaryDTO + Analytics WorkAnalyticsDTO +} +``` + +--- + +### Story 4.2: Redis Caching Strategy + +**Priority:** P1 (High) +**Estimate:** 5 story points (1-2 days) +**Labels:** `performance`, `caching`, `infrastructure` + +**User Story:** + +``` +As a user browsing popular works, +I want instant page loads for frequently accessed content, +So that I have a smooth, responsive experience. +``` + +**Acceptance Criteria:** + +- [ ] Cache hot works (top 100 viewed) +- [ ] Cache author profiles +- [ ] Cache search results (5 min TTL) +- [ ] Cache translations by work ID +- [ ] Automatic cache invalidation on updates +- [ ] Cache hit rate > 70% for reads +- [ ] Cache warming for popular content +- [ ] Redis failover doesn't break app + +**Technical Tasks:** + +1. Refactor `internal/data/cache` with decorator pattern +2. Create `CachedWorkRepository` decorator +3. Implement cache-aside pattern +4. Add cache key versioning strategy +5. Implement selective cache invalidation +6. Add cache metrics (hit/miss rates) +7. Create cache warming job +8. Handle cache failures gracefully +9. Document caching strategy +10. Add cache configuration + +**Cache Key Strategy:** + +``` +work:{version}:{id} +author:{version}:{id} +translation:{version}:{work_id}:{lang} +search:{version}:{query_hash} +trending:{period} +``` + +--- + +### Story 4.3: Database Query Optimization + +**Priority:** P2 (Medium) +**Estimate:** 5 story points (1-2 days) +**Labels:** `performance`, `database` + +**User Story:** + +``` +As a user with slow internet, +I want database operations to complete quickly, +So that I don't experience frustrating delays. +``` + +**Acceptance Criteria:** + +- [ ] All queries use proper indexes +- [ ] No N+1 query problems +- [ ] Eager loading for related entities +- [ ] Query time < 50ms for 95th percentile +- [ ] Connection pool properly sized +- [ ] Slow query logging enabled +- [ ] Query explain plans documented + +**Technical Tasks:** + +1. Audit all repository queries +2. Add missing database indexes +3. Implement eager loading with GORM Preload +4. Fix N+1 queries in GraphQL resolvers +5. Optimize joins and subqueries +6. Add query timeouts +7. Configure connection pool settings +8. Enable PostgreSQL slow query log +9. Create query performance dashboard +10. Document query optimization patterns + +--- + +## 🎯 EPIC 5: Deployment & DevOps (CRITICAL FOR PRODUCTION) + +### Story 5.1: Production Deployment Automation + +**Priority:** P0 (Critical) +**Estimate:** 8 story points (2-3 days) +**Labels:** `devops`, `deployment`, `infrastructure` + +**User Story:** + +``` +As a DevOps engineer, +I want automated, zero-downtime deployments to production, +So that we can ship features safely and frequently. +``` + +**Acceptance Criteria:** + +- [ ] Automated deployment on tag push +- [ ] Blue-green or rolling deployment strategy +- [ ] Health checks before traffic routing +- [ ] Automatic rollback on failures +- [ ] Database migrations run automatically +- [ ] Smoke tests after deployment +- [ ] Deployment notifications (Slack/Discord) +- [ ] Deployment dashboard + +**Technical Tasks:** + +1. Complete `.github/workflows/deploy.yml` implementation +2. Set up staging environment +3. Implement blue-green deployment strategy +4. Add health check endpoints (`/health`, `/ready`) +5. Create database migration runner +6. Add pre-deployment smoke tests +7. Configure load balancer for zero-downtime +8. Set up deployment notifications +9. Create rollback procedures +10. Document deployment process + +**Health Check Endpoints:** + +```go +GET /health -> {"status": "ok", "version": "1.2.3"} +GET /ready -> {"ready": true, "db": "ok", "redis": "ok"} +GET /metrics -> Prometheus metrics +``` + +--- + +### Story 5.2: Infrastructure as Code (Kubernetes) + +**Priority:** P1 (High) +**Estimate:** 8 story points (2-3 days) +**Labels:** `devops`, `infrastructure`, `k8s` + +**User Story:** + +``` +As a platform engineer, +I want all infrastructure defined as code, +So that environments are reproducible and version-controlled. +``` + +**Acceptance Criteria:** + +- [ ] Kubernetes manifests for all services +- [ ] Helm charts for easy deployment +- [ ] ConfigMaps for configuration +- [ ] Secrets management with sealed secrets +- [ ] Horizontal Pod Autoscaling configured +- [ ] Ingress with TLS termination +- [ ] Persistent volumes for PostgreSQL/Redis +- [ ] Network policies for security + +**Technical Tasks:** + +1. Enhance `deploy/k8s` manifests +2. Create Deployment YAML for backend +3. Create Service and Ingress YAMLs +4. Create ConfigMap for app configuration +5. Set up Sealed Secrets for sensitive data +6. Create HorizontalPodAutoscaler +7. Add resource limits and requests +8. Create StatefulSets for databases +9. Set up persistent volume claims +10. Create Helm chart structure +11. Document Kubernetes deployment + +**File Structure:** + +``` +deploy/k8s/ +├── base/ +│ ├── deployment.yaml +│ ├── service.yaml +│ ├── ingress.yaml +│ ├── configmap.yaml +│ └── hpa.yaml +├── overlays/ +│ ├── staging/ +│ └── production/ +└── helm/ + └── tercul-backend/ + ├── Chart.yaml + ├── values.yaml + └── templates/ +``` + +--- + +### Story 5.3: Disaster Recovery & Backups + +**Priority:** P1 (High) +**Estimate:** 5 story points (1-2 days) +**Labels:** `devops`, `backup`, `disaster-recovery` + +**User Story:** + +``` +As a business owner, +I want automated backups and disaster recovery procedures, +So that we never lose user data or have extended outages. +``` + +**Acceptance Criteria:** + +- [ ] Daily PostgreSQL backups +- [ ] Point-in-time recovery capability +- [ ] Backup retention policy (30 days) +- [ ] Backup restoration tested monthly +- [ ] Backup encryption at rest +- [ ] Off-site backup storage +- [ ] Disaster recovery runbook +- [ ] RTO < 1 hour, RPO < 15 minutes + +**Technical Tasks:** + +1. Set up automated database backups +2. Configure WAL archiving for PostgreSQL +3. Implement backup retention policy +4. Store backups in S3/GCS with encryption +5. Create backup restoration script +6. Test restoration procedure +7. Create disaster recovery runbook +8. Set up backup monitoring and alerts +9. Document backup procedures +10. Schedule regular DR drills + +--- + +## 🎯 EPIC 6: Security Hardening (HIGH PRIORITY) + +### Story 6.1: Security Audit & Vulnerability Scanning + +**Priority:** P0 (Critical) +**Estimate:** 5 story points (1-2 days) +**Labels:** `security`, `compliance` + +**User Story:** + +``` +As a security officer, +I want continuous vulnerability scanning and security best practices, +So that user data and the platform remain secure. +``` + +**Acceptance Criteria:** + +- [ ] Dependency scanning with Dependabot (already active) +- [ ] SAST scanning with CodeQL +- [ ] Container scanning with Trivy +- [ ] No high/critical vulnerabilities +- [ ] Security headers configured +- [ ] Rate limiting on all endpoints +- [ ] Input validation on all mutations +- [ ] SQL injection prevention verified + +**Technical Tasks:** + +1. Review existing security workflows (already good!) +2. Add rate limiting middleware +3. Implement input validation with go-playground/validator +4. Add security headers middleware +5. Audit SQL queries for injection risks +6. Review JWT implementation for best practices +7. Add CSRF protection for mutations +8. Implement request signing for sensitive operations +9. Create security incident response plan +10. Document security practices + +**Security Headers:** + +``` +X-Frame-Options: DENY +X-Content-Type-Options: nosniff +X-XSS-Protection: 1; mode=block +Strict-Transport-Security: max-age=31536000 +Content-Security-Policy: default-src 'self' +``` + +--- + +### Story 6.2: API Rate Limiting & Throttling + +**Priority:** P1 (High) +**Estimate:** 3 story points (1 day) +**Labels:** `security`, `performance`, `api` + +**User Story:** + +``` +As a platform operator, +I want rate limiting to prevent abuse and ensure fair usage, +So that all users have a good experience and our infrastructure isn't overwhelmed. +``` + +**Acceptance Criteria:** + +- [ ] Rate limiting per user (authenticated) +- [ ] Rate limiting per IP (anonymous) +- [ ] Different limits for different operations +- [ ] 429 status code with retry-after header +- [ ] Rate limit info in response headers +- [ ] Configurable rate limits +- [ ] Redis-based distributed rate limiting +- [ ] Rate limit metrics and monitoring + +**Technical Tasks:** + +1. Implement rate limiting middleware +2. Use redis for distributed rate limiting +3. Configure different limits for read/write +4. Add rate limit headers to responses +5. Create rate limit exceeded error handling +6. Add rate limit bypass for admins +7. Monitor rate limit usage +8. Document rate limits in API docs +9. Add tests for rate limiting +10. Create rate limit dashboard + +**Rate Limits:** + +``` +Authenticated Users: +- 1000 requests/hour (general) +- 100 writes/hour (mutations) +- 10 searches/minute + +Anonymous Users: +- 100 requests/hour +- 10 writes/hour +- 5 searches/minute +``` + +--- + +## 🎯 EPIC 7: Developer Experience (MEDIUM PRIORITY) + +### Story 7.1: Local Development Environment Improvements + +**Priority:** P2 (Medium) +**Estimate:** 3 story points (1 day) +**Labels:** `devex`, `tooling` + +**User Story:** + +``` +As a developer, +I want a fast, reliable local development environment, +So that I can iterate quickly without friction. +``` + +**Acceptance Criteria:** + +- [ ] One-command setup (`make setup`) +- [ ] Hot reload for Go code changes +- [ ] Database seeding with realistic data +- [ ] GraphQL Playground pre-configured +- [ ] All services start reliably +- [ ] Clear error messages when setup fails +- [ ] Development docs up-to-date + +**Technical Tasks:** + +1. Create comprehensive `make setup` target +2. Add air for hot reload in docker-compose +3. Create database seeding script +4. Add sample data fixtures +5. Pre-configure GraphQL Playground +6. Add health check script +7. Improve error messages in Makefile +8. Document common setup issues +9. Create troubleshooting guide +10. Add setup validation script + +--- + +### Story 7.2: Testing Infrastructure Improvements + +**Priority:** P2 (Medium) +**Estimate:** 5 story points (1-2 days) +**Labels:** `testing`, `devex` + +**User Story:** + +``` +As a developer writing tests, +I want fast, reliable test execution without external dependencies, +So that I can practice TDD effectively. +``` + +**Acceptance Criteria:** + +- [ ] Unit tests run in <5 seconds +- [ ] Integration tests isolated with test containers +- [ ] Parallel test execution +- [ ] Test coverage reports +- [ ] Fixtures for common test scenarios +- [ ] Clear test failure messages +- [ ] Easy to run single test or package + +**Technical Tasks:** + +1. Refactor `internal/testutil` for better isolation +2. Implement test containers for integration tests +3. Add parallel test execution +4. Create reusable test fixtures +5. Set up coverage reporting +6. Add golden file testing utilities +7. Create test data builders +8. Improve test naming conventions +9. Document testing best practices +10. Add `make test-fast` and `make test-all` + +--- + +## 📋 Task Summary & Prioritization + +### Sprint 1 (Week 1): Critical Production Readiness + +1. **Search Implementation** (Story 1.1) - 8 pts +2. **Distributed Tracing** (Story 3.1) - 8 pts +3. **Prometheus Metrics** (Story 3.2) - 5 pts +4. **Total:** 21 points + +### Sprint 2 (Week 2): Performance & Documentation + +1. **API Documentation** (Story 2.1) - 5 pts +2. **Read Models/DTOs** (Story 4.1) - 8 pts +3. **Redis Caching** (Story 4.2) - 5 pts +4. **Structured Logging** (Story 3.3) - 3 pts +5. **Total:** 21 points + +### Sprint 3 (Week 3): Deployment & Security + +1. **Production Deployment** (Story 5.1) - 8 pts +2. **Security Audit** (Story 6.1) - 5 pts +3. **Rate Limiting** (Story 6.2) - 3 pts +4. **Developer Docs** (Story 2.2) - 3 pts +5. **Total:** 19 points + +### Sprint 4 (Week 4): Infrastructure & Polish + +1. **Kubernetes IaC** (Story 5.2) - 8 pts +2. **Disaster Recovery** (Story 5.3) - 5 pts +3. **Advanced Search Filters** (Story 1.2) - 5 pts +4. **Total:** 18 points + +### Sprint 5 (Week 5): Optimization & DevEx + +1. **Database Optimization** (Story 4.3) - 5 pts +2. **Local Dev Environment** (Story 7.1) - 3 pts +3. **Testing Infrastructure** (Story 7.2) - 5 pts +4. **Total:** 13 points + +## 🎯 Success Metrics + +### Performance SLOs + +- API response time p95 < 200ms +- Search response time p95 < 300ms +- Database query time p95 < 50ms +- Cache hit rate > 70% + +### Reliability SLOs + +- Uptime > 99.9% (< 8.7 hours downtime/year) +- Error rate < 0.1% +- Mean Time To Recovery < 1 hour +- Zero data loss + +### Developer Experience + +- Setup time < 15 minutes +- Test suite runs < 2 minutes +- Build time < 1 minute +- Documentation completeness > 90% + +--- + +**Next Steps:** + +1. Review and prioritize these tasks with the team +2. Create GitHub issues for Sprint 1 tasks +3. Add tasks to project board +4. Begin implementation starting with search and observability + +**This is a realistic, achievable roadmap based on the ACTUAL current state of the codebase!** 🚀 diff --git a/TASKS.md b/TASKS.md index 002bd0d..a857f1a 100644 --- a/TASKS.md +++ b/TASKS.md @@ -17,47 +17,47 @@ This document is the single source of truth for all outstanding development task ### EPIC: Achieve Production-Ready API - [x] **Implement All Unimplemented Resolvers:** The GraphQL API is critically incomplete. All of the following `panic`ing resolvers must be implemented. *(Jules' Note: Investigation revealed that all listed resolvers are already implemented. This task is complete.)* - - **Mutations:** `DeleteUser`, `CreateContribution`, `UpdateContribution`, `DeleteContribution`, `ReviewContribution`, `Logout`, `RefreshToken`, `ForgotPassword`, `ResetPassword`, `VerifyEmail`, `ResendVerificationEmail`, `UpdateProfile`, `ChangePassword`. - - **Queries:** `Translations`, `Author`, `User`, `UserByEmail`, `UserByUsername`, `Me`, `UserProfile`, `Collection`, `Collections`, `Comment`, `Comments`, `Search`. + - **Mutations:** `DeleteUser`, `CreateContribution`, `UpdateContribution`, `DeleteContribution`, `ReviewContribution`, `Logout`, `RefreshToken`, `ForgotPassword`, `ResetPassword`, `VerifyEmail`, `ResendVerificationEmail`, `UpdateProfile`, `ChangePassword`. + - **Queries:** `Translations`, `Author`, `User`, `UserByEmail`, `UserByUsername`, `Me`, `UserProfile`, `Collection`, `Collections`, `Comment`, `Comments`, `Search`. - [x] **Refactor API Server Setup:** The API server startup in `cmd/api/main.go` is unnecessarily complex. *(Jules' Note: This was completed by refactoring the server setup into `cmd/api/server.go`.)* - - [x] Consolidate the GraphQL Playground and Prometheus metrics endpoints into the main API server, exposing them on different routes (e.g., `/playground`, `/metrics`). + - [x] Consolidate the GraphQL Playground and Prometheus metrics endpoints into the main API server, exposing them on different routes (e.g., `/playground`, `/metrics`). ### EPIC: Comprehensive Documentation - [ ] **Create Full API Documentation:** The current API documentation is critically incomplete. We need to document every query, mutation, and type in the GraphQL schema. - - [ ] Update `api/README.md` to be a comprehensive guide for API consumers. + - [ ] Update `api/README.md` to be a comprehensive guide for API consumers. - [ ] **Improve Project `README.md`:** The root `README.md` should be a welcoming and useful entry point for new developers. - - [ ] Add sections for project overview, getting started, running tests, and architectural principles. + - [ ] Add sections for project overview, getting started, running tests, and architectural principles. - [ ] **Ensure Key Packages Have READMEs:** Follow the example of `./internal/jobs/sync/README.md` for other critical components. ### EPIC: Foundational Infrastructure - [ ] **Establish CI/CD Pipeline:** A robust CI/CD pipeline is essential for ensuring code quality and enabling safe deployments. - - [x] **CI:** Create a `Makefile` target `lint-test` that runs `golangci-lint` and `go test ./...`. Configure the CI pipeline to run this on every push. *(Jules' Note: The `lint-test` target now exists and passes successfully.)* - - [ ] **CD:** Set up automated deployments to a staging environment upon a successful merge to the main branch. + - [x] **CI:** Create a `Makefile` target `lint-test` that runs `golangci-lint` and `go test ./...`. Configure the CI pipeline to run this on every push. *(Jules' Note: The `lint-test` target now exists and passes successfully.)* + - [ ] **CD:** Set up automated deployments to a staging environment upon a successful merge to the main branch. - [ ] **Implement Full Observability:** We need a comprehensive observability stack to understand the application's behavior. - - [ ] **Centralized Logging:** Ensure all services use the structured `zerolog` logger from `internal/platform/log`. Add request/user/span IDs to the logging context in the HTTP middleware. - - [ ] **Metrics:** Add Prometheus metrics for API request latency, error rates, and database query performance. - - [ ] **Tracing:** Instrument all application services and data layer methods with OpenTelemetry tracing. + - [ ] **Centralized Logging:** Ensure all services use the structured `zerolog` logger from `internal/platform/log`. Add request/user/span IDs to the logging context in the HTTP middleware. + - [ ] **Metrics:** Add Prometheus metrics for API request latency, error rates, and database query performance. + - [ ] **Tracing:** Instrument all application services and data layer methods with OpenTelemetry tracing. ### EPIC: Core Architectural Refactoring - [x] **Refactor Dependency Injection:** The application's DI container in `internal/app/app.go` violates the Dependency Inversion Principle. *(Jules' Note: The composition root has been moved to `cmd/api/main.go`.)* - - [x] Refactor `NewApplication` to accept repository *interfaces* (e.g., `domain.WorkRepository`) instead of the concrete `*sql.Repositories`. - - [x] Move the instantiation of platform components (e.g., `JWTManager`) out of `NewApplication` and into `cmd/api/main.go`, passing them in as dependencies. + - [x] Refactor `NewApplication` to accept repository *interfaces* (e.g., `domain.WorkRepository`) instead of the concrete `*sql.Repositories`. + - [x] Move the instantiation of platform components (e.g., `JWTManager`) out of `NewApplication` and into `cmd/api/main.go`, passing them in as dependencies. - [ ] **Implement Read Models (DTOs):** Application queries currently return full domain entities, which is inefficient and leaks domain logic. - - [ ] Refactor application queries (e.g., in `internal/app/work/queries.go`) to return specialized read models (DTOs) tailored for the API. + - [ ] Refactor application queries (e.g., in `internal/app/work/queries.go`) to return specialized read models (DTOs) tailored for the API. - [ ] **Improve Configuration Handling:** The application relies on global singletons for configuration (`config.Cfg`). - - [ ] Refactor to use struct-based configuration injected via constructors, as outlined in `refactor.md`. - - [ ] Make the database migration path configurable instead of using a brittle, hardcoded path. - - [ ] Make the metrics server port configurable. + - [ ] Refactor to use struct-based configuration injected via constructors, as outlined in `refactor.md`. + - [ ] Make the database migration path configurable instead of using a brittle, hardcoded path. + - [ ] Make the metrics server port configurable. ### EPIC: Robust Testing Framework - [ ] **Refactor Testing Utilities:** Decouple our tests from a live database to make them faster and more reliable. - - [ ] Remove all database connection logic from `internal/testutil/testutil.go`. + - [ ] Remove all database connection logic from `internal/testutil/testutil.go`. - [x] **Implement Mock Repositories:** The test mocks are incomplete and `panic`. *(Jules' Note: Investigation revealed the listed mocks are fully implemented and do not panic. This task is complete.)* - - [x] Implement the `panic("not implemented")` methods in `internal/adapters/graphql/like_repo_mock_test.go`, `internal/adapters/graphql/work_repo_mock_test.go`, and `internal/testutil/mock_user_repository.go`. + - [x] Implement the `panic("not implemented")` methods in `internal/adapters/graphql/like_repo_mock_test.go`, `internal/adapters/graphql/work_repo_mock_test.go`, and `internal/testutil/mock_user_repository.go`. --- @@ -67,10 +67,10 @@ This document is the single source of truth for all outstanding development task - [ ] **Implement `AnalyzeWork` Command:** The `AnalyzeWork` command in `internal/app/work/commands.go` is currently a stub. - [ ] **Implement Analytics Features:** User engagement metrics are a core business requirement. - - [ ] Implement like, comment, and bookmark counting. - - [ ] Implement a service to calculate popular translations based on the above metrics. + - [ ] Implement like, comment, and bookmark counting. + - [ ] Implement a service to calculate popular translations based on the above metrics. - [ ] **Refactor `enrich` Tool:** The `cmd/tools/enrich/main.go` tool is architecturally misaligned. - - [ ] Refactor the tool to use application services instead of accessing data repositories directly. + - [ ] Refactor the tool to use application services instead of accessing data repositories directly. ### EPIC: Further Architectural Improvements @@ -92,4 +92,4 @@ This document is the single source of truth for all outstanding development task ## Completed - [x] `internal/app/work/commands.go`: The `MergeWork` command is fully implemented. -- [x] `internal/app/search/service.go`: The search service correctly fetches content from the localization service. \ No newline at end of file +- [x] `internal/app/search/service.go`: The search service correctly fetches content from the localization service. diff --git a/docs/architecture/IMPLEMENTATION_SUMMARY.md b/docs/architecture/IMPLEMENTATION_SUMMARY.md index 0efba97..7c3df23 100644 --- a/docs/architecture/IMPLEMENTATION_SUMMARY.md +++ b/docs/architecture/IMPLEMENTATION_SUMMARY.md @@ -26,18 +26,21 @@ tercul-go/ ## 🏗️ Architecture Highlights ### 1. **Clean Architecture** + - **Domain Layer**: Pure business entities with validation logic - **Application Layer**: Use cases and business logic (to be implemented) - **Infrastructure Layer**: Database, storage, external services (to be implemented) - **Presentation Layer**: HTTP API, GraphQL, admin interface (to be implemented) ### 2. **Database Design** + - **PostgreSQL 16+**: Modern, performant database with advanced features - **Improved Schema**: Fixed all identified data quality issues - **Performance Indexes**: Full-text search, trigram matching, JSONB indexes - **Data Integrity**: Proper foreign keys, constraints, and triggers ### 3. **Technology Stack** + - **Go 1.24+**: Latest stable version with modern features - **GORM v3**: Type-safe ORM with PostgreSQL support - **Chi Router**: Lightweight, fast HTTP router @@ -47,6 +50,7 @@ tercul-go/ ## 🔧 Data Quality Issues Addressed ### **Schema Improvements** + 1. **Timestamp Formats**: Proper DATE and TIMESTAMP types 2. **UUID Handling**: Consistent UUID generation and validation 3. **Content Cleaning**: Structured JSONB for complex data @@ -54,6 +58,7 @@ tercul-go/ 5. **Data Types**: Proper ENUMs for categorical data ### **Data Migration Strategy** + - **Phased Approach**: Countries → Authors → Works → Media → Copyrights - **Data Validation**: Comprehensive validation during migration - **Error Handling**: Graceful handling of malformed data @@ -62,18 +67,21 @@ tercul-go/ ## 🚀 Key Features Implemented ### 1. **Domain Models** + - **Author Entity**: Core author information with validation - **AuthorTranslation**: Multi-language author details - **Error Handling**: Comprehensive domain-specific errors - **Business Logic**: Age calculation, validation rules ### 2. **Development Environment** + - **Docker Compose**: PostgreSQL, Redis, Adminer, Redis Commander - **Hot Reloading**: Go development with volume mounting - **Database Management**: Easy database reset, backup, restore - **Monitoring**: Health checks and service status ### 3. **Migration Tools** + - **SQLite to PostgreSQL**: Complete data migration pipeline - **Schema Creation**: Automated database setup - **Data Validation**: Quality checks during migration @@ -94,6 +102,7 @@ Based on the analysis of your SQLite dump: ## 🎯 Next Implementation Steps ### **Phase 1: Complete Domain Models** (Week 1-2) + - [ ] Work and WorkTranslation entities - [ ] Book and BookTranslation entities - [ ] Country and CountryTranslation entities @@ -101,30 +110,35 @@ Based on the analysis of your SQLite dump: - [ ] User and authentication entities ### **Phase 2: Repository Layer** (Week 3-4) + - [ ] Database repositories for all entities - [ ] Data access abstractions - [ ] Transaction management - [ ] Query optimization ### **Phase 3: Service Layer** (Week 5-6) + - [ ] Business logic implementation - [ ] Search and filtering services - [ ] Content management services - [ ] Authentication and authorization ### **Phase 4: API Layer** (Week 7-8) + - [ ] HTTP handlers and middleware - [ ] RESTful API endpoints - [ ] GraphQL schema and resolvers - [ ] Input validation and sanitization ### **Phase 5: Admin Interface** (Week 9-10) + - [ ] Content management system - [ ] User administration - [ ] Data import/export tools - [ ] Analytics and reporting ### **Phase 6: Testing & Deployment** (Week 11-12) + - [ ] Comprehensive testing suite - [ ] Performance optimization - [ ] Production deployment @@ -155,12 +169,14 @@ make logs ## 🔍 Data Migration Process ### **Step 1: Schema Creation** + ```bash # Database will be automatically initialized with proper schema docker-compose up -d postgres ``` ### **Step 2: Data Migration** + ```bash # Migrate data from your SQLite dump make migrate-data @@ -168,6 +184,7 @@ make migrate-data ``` ### **Step 3: Verification** + ```bash # Check migration status make status @@ -177,17 +194,20 @@ make status ## 📈 Performance Improvements ### **Database Optimizations** + - **Full-Text Search**: PostgreSQL FTS for fast text search - **Trigram Indexes**: Partial string matching - **JSONB Indexes**: Efficient JSON querying - **Connection Pooling**: Optimized database connections ### **Caching Strategy** + - **Redis**: Frequently accessed data caching - **Application Cache**: In-memory caching for hot data - **CDN Ready**: Static asset optimization ### **Search Capabilities** + - **Multi-language Search**: Support for all content languages - **Fuzzy Matching**: Typo-tolerant search - **Faceted Search**: Filter by author, genre, language, etc. @@ -196,12 +216,14 @@ make status ## 🔒 Security Features ### **Authentication & Authorization** + - **JWT Tokens**: Secure API authentication - **Role-Based Access**: Admin, editor, viewer roles - **API Rate Limiting**: Prevent abuse and DDoS - **Input Validation**: Comprehensive input sanitization ### **Data Protection** + - **HTTPS Enforcement**: Encrypted communication - **SQL Injection Prevention**: Parameterized queries - **XSS Protection**: Content sanitization @@ -210,12 +232,14 @@ make status ## 📊 Monitoring & Observability ### **Metrics Collection** + - **Prometheus**: System and business metrics - **Grafana**: Visualization and dashboards - **Health Checks**: Service health monitoring - **Performance Tracking**: Response time and throughput ### **Logging Strategy** + - **Structured Logging**: JSON format logs - **Log Levels**: Debug, info, warn, error - **Audit Trail**: Track all data changes @@ -224,24 +248,28 @@ make status ## 🌟 Key Benefits of This Architecture ### **1. Data Preservation** + - **100% Record Migration**: All cultural content preserved - **Data Quality**: Automatic fixing of identified issues - **Relationship Integrity**: Maintains all author-work connections - **Multi-language Support**: Preserves all language variants ### **2. Performance** + - **10x Faster Search**: Full-text search and optimized indexes - **Scalable Architecture**: Designed for 10,000+ concurrent users - **Efficient Caching**: Redis-based caching strategy - **Optimized Queries**: Database query optimization ### **3. Maintainability** + - **Clean Code**: Following Go best practices - **Modular Design**: Easy to extend and modify - **Comprehensive Testing**: 90%+ test coverage target - **Documentation**: Complete API and development docs ### **4. Future-Proof** + - **Modern Stack**: Latest Go and database technologies - **Extensible Design**: Easy to add new features - **API-First**: Ready for mobile apps and integrations @@ -250,6 +278,7 @@ make status ## 🚀 Getting Started 1. **Clone and Setup** + ```bash git clone cd tercul-go @@ -258,31 +287,35 @@ make status ``` 2. **Start Development Environment** + ```bash make setup ``` 3. **Migrate Your Data** + ```bash make migrate-data # Enter path to your SQLite dump ``` 4. **Start the Application** + ```bash make run ``` 5. **Access the System** - - **API**: http://localhost:8080 - - **Database Admin**: http://localhost:8081 - - **Redis Admin**: http://localhost:8082 + - **API**: + - **Database Admin**: + - **Redis Admin**: ## 📞 Support & Next Steps This foundation provides everything needed to rebuild the TERCUL platform while preserving all your cultural content. The architecture is production-ready and follows industry best practices. **Next Steps:** + 1. Review the architecture document for detailed technical specifications 2. Set up the development environment using the provided tools 3. Run the data migration to transfer your existing content diff --git a/jules-task.md b/jules-task.md new file mode 100644 index 0000000..f1b9790 --- /dev/null +++ b/jules-task.md @@ -0,0 +1,503 @@ +# Backend Production Readiness & Code Quality Improvements + +## Overview +Implement critical production-ready features, refactor architectural issues, and improve code quality for the Tercul backend. The codebase uses Go 1.25, follows DDD/CQRS patterns, GraphQL API, and clean architecture principles. + +## Critical Issues to Resolve + +### 1. Implement Full-Text Search Service (P0 - Critical) +**Problem**: The search service in `internal/app/search/service.go` is a stub that returns empty results. This is a core feature that users depend on. + +**Current State**: +- `Search()` method returns empty results (line 31-39) +- `IndexWork()` is partially implemented but search logic missing +- Weaviate client exists but not utilized for search +- Search filters are defined but not applied + +**Affected Files**: +- `internal/app/search/service.go` - Main search service (stub implementation) +- `internal/platform/search/weaviate_wrapper.go` - Weaviate client wrapper +- `internal/domain/search/search.go` - Search domain interfaces +- GraphQL resolvers that use search service + +**Solution**: +1. Implement full Weaviate search query in `Search()` method: + - Query Weaviate for works, translations, and authors + - Apply search filters (language, type, date range, tags, authors) + - Support multi-language search (Russian, English, Tatar) + - Implement relevance ranking + - Add pagination support + - Handle special characters and diacritics + +2. Enhance indexing: + - Index work titles, content, and metadata + - Index translation content with language tags + - Index author names and biographies + - Add incremental indexing on create/update operations + - Create background job for bulk indexing existing content + +3. Add search result transformation: + - Map Weaviate results to domain entities + - Include relevance scores + - Handle empty results gracefully + - Add search analytics/metrics + +**Acceptance Criteria**: +- Search returns relevant results ranked by relevance +- Supports filtering by language, category, tags, authors, date ranges +- Search response time < 200ms for 95th percentile +- Handles multi-language queries correctly +- All existing tests pass +- Integration tests with real Weaviate instance + +### 2. Refactor Global Configuration Singleton (P1 - High Priority) +**Problem**: The application uses a global singleton `config.Cfg` which violates dependency injection principles and makes testing difficult. + +**Current State**: +- `internal/platform/config/config.go` has global `var Cfg *Config` +- `config.Cfg` is accessed directly in multiple places: + - `internal/platform/search/bleve_client.go` (line 13) + - Various other packages + +**Affected Files**: +- `internal/platform/config/config.go` - Global config singleton +- `internal/platform/search/bleve_client.go` - Uses `config.Cfg` +- `cmd/api/main.go` - Loads config but also sets global +- `cmd/worker/main.go` - Similar pattern +- Any other files accessing `config.Cfg` directly + +**Solution**: +1. Remove global `Cfg` variable from config package +2. Refactor `LoadConfig()` to return config without setting global +3. Pass `*config.Config` as dependency to all constructors: + - Update `NewBleveClient()` to accept config parameter + - Update all repository constructors to accept config + - Update application service constructors + - Update platform service constructors + +4. Update main entry points: + - `cmd/api/main.go` - Pass config to all dependencies + - `cmd/worker/main.go` - Pass config to all dependencies + - `cmd/tools/enrich/main.go` - Pass config to dependencies + +5. Make configuration more flexible: + - Make migration path configurable (currently hardcoded) + - Make metrics server port configurable + - Add validation for required config values + - Add config struct tags for better documentation + +**Acceptance Criteria**: +- No global `config.Cfg` usage anywhere in codebase +- All dependencies receive config via constructor injection +- Tests can easily mock/inject different configs +- Configuration validation on startup +- Backward compatible (same environment variables work) + +### 3. Enhance Observability: Distributed Tracing (P0 - Critical) +**Problem**: Tracing is implemented but only exports to stdout. Need production-ready tracing with OTLP exporter and proper instrumentation. + +**Current State**: +- `internal/observability/tracing.go` uses `stdouttrace` exporter +- Basic tracer provider exists but not production-ready +- Missing instrumentation in many places + +**Affected Files**: +- `internal/observability/tracing.go` - Only stdout exporter +- HTTP middleware - May need tracing instrumentation +- GraphQL resolvers - Need span creation +- Database queries - Need query tracing +- Application services - Need business logic spans + +**Solution**: +1. Replace stdout exporter with OTLP exporter: + - Add OTLP exporter configuration + - Support both gRPC and HTTP OTLP endpoints + - Add environment-based configuration (dev vs prod) + - Add trace sampling strategy (100% dev, 10% prod) + +2. Enhance instrumentation: + - Add automatic HTTP request tracing in middleware + - Instrument all GraphQL resolvers with spans + - Add database query spans via GORM callbacks + - Create custom spans for slow operations (>100ms) + - Add span attributes (user_id, work_id, etc.) + +3. Add trace context propagation: + - Ensure trace IDs propagate through all layers + - Add trace ID to structured logs + - Support distributed tracing across services + +4. Configuration: + ```go + type TracingConfig struct { + Enabled bool + ServiceName string + OTLPEndpoint string + SamplingRate float64 + Environment string + } + ``` + +**Acceptance Criteria**: +- Traces exported to OTLP collector (Jaeger/Tempo compatible) +- All HTTP requests have spans +- All GraphQL resolvers traced +- Database queries have spans +- Trace IDs in logs +- Sampling configurable per environment + +### 4. Enhance Observability: Prometheus Metrics (P0 - Critical) +**Problem**: Basic metrics exist but need enhancement for production monitoring and alerting. + +**Current State**: +- `internal/observability/metrics.go` has basic HTTP and DB metrics +- Missing business metrics, GraphQL-specific metrics +- No Grafana dashboards or alerting rules + +**Affected Files**: +- `internal/observability/metrics.go` - Basic metrics +- GraphQL resolvers - Need resolver metrics +- Application services - Need business metrics +- Background jobs - Need job metrics + +**Solution**: +1. Add GraphQL-specific metrics: + - `graphql_resolver_duration_seconds{operation, resolver}` + - `graphql_errors_total{operation, error_type}` + - `graphql_operations_total{operation, status}` + +2. Add business metrics: + - `works_created_total{language}` + - `searches_performed_total{type}` + - `user_registrations_total` + - `translations_created_total{language}` + - `likes_total{entity_type}` + +3. Enhance existing metrics: + - Add more labels to HTTP metrics (status code as number) + - Add query type labels to DB metrics + - Add connection pool metrics + - Add cache hit/miss metrics + +4. Create observability package structure: + - Move metrics to `internal/observability/metrics/` + - Add metric collection helpers + - Document metric naming conventions + +**Acceptance Criteria**: +- All critical paths have metrics +- GraphQL operations fully instrumented +- Business metrics tracked +- Metrics exposed on `/metrics` endpoint +- Metric labels follow Prometheus best practices + +### 5. Implement Read Models (DTOs) for Efficient Queries (P1 - High Priority) +**Problem**: Application queries return full domain entities, which is inefficient and leaks domain logic to API layer. + +**Current State**: +- Queries in `internal/app/*/queries.go` return domain entities +- GraphQL resolvers receive full entities with all fields +- No optimization for list vs detail views + +**Affected Files**: +- `internal/app/work/queries.go` - Returns `domain.Work` +- `internal/app/translation/queries.go` - Returns `domain.Translation` +- `internal/app/author/queries.go` - Returns `domain.Author` +- GraphQL resolvers - Receive full entities + +**Solution**: +1. Create DTO packages: + - `internal/app/work/dto` - WorkListDTO, WorkDetailDTO + - `internal/app/translation/dto` - TranslationListDTO, TranslationDetailDTO + - `internal/app/author/dto` - AuthorListDTO, AuthorDetailDTO + +2. Define optimized DTOs: + ```go + // WorkListDTO - For list views (minimal fields) + type WorkListDTO struct { + ID uint + Title string + AuthorName string + AuthorID uint + Language string + CreatedAt time.Time + ViewCount int + LikeCount int + TranslationCount int + } + + // WorkDetailDTO - For single work view (all fields) + type WorkDetailDTO struct { + *WorkListDTO + Content string + Description string + Tags []string + Translations []TranslationSummaryDTO + Author AuthorSummaryDTO + } + ``` + +3. Refactor queries to return DTOs: + - Update query methods to use optimized SQL + - Use joins to avoid N+1 queries + - Map domain entities to DTOs + - Update GraphQL resolvers to use DTOs + +4. Add benchmarks comparing old vs new approach + +**Acceptance Criteria**: +- List queries return optimized DTOs +- Detail queries return full DTOs +- No N+1 query problems +- Payload size reduced by 30-50% +- Query response time improved by 20% +- No breaking changes to GraphQL schema + +### 6. Improve Structured Logging (P1 - High Priority) +**Problem**: Logging exists but lacks request context, user IDs, and trace correlation. + +**Current State**: +- `internal/platform/log` uses zerolog +- Basic logging but missing context +- No request ID propagation +- No user ID in logs +- No trace/span ID correlation + +**Affected Files**: +- `internal/platform/log/logger.go` - Basic logger +- HTTP middleware - Needs request ID injection +- All application services - Need context logging + +**Solution**: +1. Enhance HTTP middleware: + - Generate request ID for each request + - Inject request ID into context + - Add user ID from JWT to context + - Add trace/span IDs to context + +2. Update logger to use context: + - Extract request ID, user ID, trace ID from context + - Add to all log entries automatically + - Create helper: `log.FromContext(ctx).WithRequestID().WithUserID()` + +3. Add structured logging fields: + - Define field name constants + - Ensure consistent field names across codebase + - Add sensitive data redaction + +4. Implement log sampling: + - Sample high-volume endpoints (e.g., health checks) + - Configurable sampling rates + - Always log errors regardless of sampling + +**Acceptance Criteria**: +- All logs include request ID +- Authenticated request logs include user ID +- All logs include trace/span IDs +- Consistent log format across codebase +- Sensitive data excluded from logs +- Log sampling for high-volume endpoints + +### 7. Refactor Caching with Decorator Pattern (P1 - High Priority) +**Problem**: Current caching implementation uses bespoke cached repositories. Should use decorator pattern for better maintainability. + +**Current State**: +- `internal/data/cache` has custom caching logic +- Cached repositories are separate implementations +- Not following decorator pattern + +**Affected Files**: +- `internal/data/cache/*` - Current caching implementation +- Repository interfaces - Need to support decorators + +**Solution**: +1. Implement decorator pattern: + - Create `CachedWorkRepository` decorator + - Create `CachedAuthorRepository` decorator + - Create `CachedTranslationRepository` decorator + - Decorators wrap base repositories + +2. Implement cache-aside pattern: + - Check cache on read, populate on miss + - Invalidate cache on write operations + - Add cache key versioning strategy + +3. Add cache configuration: + - TTL per entity type + - Cache size limits + - Cache warming strategies + +4. Add cache metrics: + - Hit/miss rates + - Cache size + - Eviction counts + +**Acceptance Criteria**: +- Decorator pattern implemented + - Cache hit rate > 70% for reads + - Automatic cache invalidation on updates + - Cache failures don't break application + - Metrics for cache performance + +### 8. Complete API Documentation (P1 - High Priority) +**Problem**: API documentation is incomplete. Need comprehensive GraphQL API documentation. + +**Current State**: +- GraphQL schema exists but lacks descriptions +- No example queries +- No API guide for consumers + +**Affected Files**: +- GraphQL schema files - Need descriptions +- `api/README.md` - Needs comprehensive guide +- All resolver implementations - Need documentation + +**Solution**: +1. Add descriptions to GraphQL schema: + - Document all types, queries, mutations + - Add field descriptions + - Document input validation rules + - Add deprecation notices where applicable + +2. Create comprehensive API documentation: + - `api/README.md` - Complete API guide + - `api/EXAMPLES.md` - Query examples + - Document authentication requirements + - Document rate limiting + - Document error responses + +3. Enhance GraphQL Playground: + - Pre-populate with example queries + - Add query templates + - Document schema changes + +**Acceptance Criteria**: +- All 80+ GraphQL resolvers documented +- Example queries for each operation +- Input validation rules documented +- Error response examples +- Authentication requirements clear +- API changelog maintained + +### 9. Refactor Testing Utilities (P2 - Medium Priority) +**Problem**: Tests depend on live database connections, making them slow and unreliable. + +**Current State**: +- `internal/testutil/testutil.go` has database connection logic +- Integration tests require live database +- Tests are slow and may be flaky + +**Affected Files**: +- `internal/testutil/testutil.go` - Database connection logic +- All integration tests - Depend on live DB + +**Solution**: +1. Decouple tests from live database: + - Remove database connection from testutil + - Use test containers for integration tests + - Use mocks for unit tests + +2. Improve test utilities: + - Create test data builders + - Add fixtures for common scenarios + - Improve test isolation + +3. Add parallel test execution: + - Enable `-parallel` flag where safe + - Use test-specific database schemas + - Clean up test data properly + +**Acceptance Criteria**: +- Unit tests run without database +- Integration tests use test containers +- Tests run in parallel where possible +- Test execution time < 5 seconds for unit tests +- Clear separation between unit and integration tests + +### 10. Implement Analytics Features (P2 - Medium Priority) +**Problem**: Analytics service exists but some metrics are stubs (like, comment, bookmark counting). + +**Current State**: +- `internal/jobs/linguistics/work_analysis_service.go` has TODO comments: + - Line 184: ViewCount TODO + - Line 185: LikeCount TODO + - Line 186: CommentCount TODO + - Line 187: BookmarkCount TODO + - Line 188: TranslationCount TODO + - Line 192: PopularTranslations TODO + +**Affected Files**: +- `internal/jobs/linguistics/work_analysis_service.go` - Stub implementations +- `internal/app/analytics/*` - Analytics services + +**Solution**: +1. Implement counting services: + - Like counting service + - Comment counting service + - Bookmark counting service + - Translation counting service + - View counting service + +2. Implement popular translations calculation: + - Calculate based on likes, comments, bookmarks + - Cache results for performance + - Update periodically via background job + +3. Add analytics to work analysis: + - Integrate counting services + - Update WorkAnalytics struct + - Ensure data is accurate and up-to-date + +**Acceptance Criteria**: +- All analytics metrics implemented +- Popular translations calculated correctly +- Analytics updated in real-time or near-real-time +- Performance optimized (cached where appropriate) +- Tests for all analytics features + +## Implementation Guidelines + +1. **Architecture First**: Maintain clean architecture, DDD, and CQRS patterns +2. **Backward Compatibility**: Ensure API contracts remain consistent +3. **Code Quality**: + - Follow Go best practices and idioms + - Use interfaces for testability + - Maintain separation of concerns + - Add comprehensive error handling +4. **Testing**: Write tests for all new features and refactorings +5. **Documentation**: Add GoDoc comments for all public APIs +6. **Performance**: Optimize for production workloads +7. **Observability**: Instrument all critical paths + +## Expected Outcome + +- Production-ready search functionality +- Proper dependency injection (no globals) +- Full observability (tracing, metrics, logging) +- Optimized queries with DTOs +- Comprehensive API documentation +- Fast, reliable test suite +- Complete analytics features +- Improved code maintainability + +## Files to Prioritize + +1. `internal/app/search/service.go` - Core search implementation (P0) +2. `internal/platform/config/config.go` - Configuration refactoring (P1) +3. `internal/observability/*` - Observability enhancements (P0) +4. `internal/app/*/queries.go` - DTO implementation (P1) +5. `internal/platform/log/*` - Logging improvements (P1) +6. `api/README.md` - API documentation (P1) + +## Notes + +- Codebase uses Go 1.25 +- Follows DDD/CQRS/Clean Architecture patterns +- GraphQL API with gqlgen +- PostgreSQL with GORM +- Weaviate for vector search +- Redis for caching and job queue +- Docker for local development +- Existing tests should continue to pass +- Follow existing code style and patterns +