25 KiB
Tercul Backend - Production Readiness Tasks
Generated: November 27, 2025 Current Status: Most core features implemented, needs production hardening
⚠️ MIGRATED TO GITHUB ISSUES
All production readiness tasks have been migrated to GitHub Issues for better tracking. See issues #30-38 in the repository: https://github.com/SamyRai/backend/issues
This document is kept for reference only and should not be used for task tracking.
📊 Current Reality Check
✅ What's Actually Working
- ✅ Full GraphQL API with 90%+ resolvers implemented
- ✅ Complete CQRS pattern (Commands & Queries)
- ✅ Auth system (Register, Login, JWT, Password Reset, Email Verification)
- ✅ Work CRUD with authorization
- ✅ Translation management with analytics
- ✅ User management and profiles
- ✅ Collections, Comments, Likes, Bookmarks
- ✅ Contributions with review workflow
- ✅ Analytics service (views, likes, trending)
- ✅ Clean Architecture with DDD patterns
- ✅ Comprehensive test coverage (passing tests)
- ✅ CI/CD pipelines (build, test, lint, security, docker)
- ✅ Docker setup and containerization
- ✅ Database migrations and schema
⚠️ What Needs Work
- ⚠️ Search functionality (stub implementation) → Issue #30
- ⚠️ Observability (metrics, tracing) → Issues #31, #32, #33
- ⚠️ Production deployment automation → Issue #36
- ⚠️ Performance optimization → Issues #34, #35
- ⚠️ Security hardening → Issue #37
- ⚠️ Infrastructure as Code → Issue #38
🎯 EPIC 1: Search & Discovery (HIGH PRIORITY)
Story 1.1: Full-Text Search Implementation
Priority: P0 (Critical)
Estimate: 8 story points (2-3 days)
Labels: enhancement, search, backend
User Story:
As a user exploring literary works,
I want to search across works, translations, and authors by keywords,
So that I can quickly find relevant content in my preferred language.
Acceptance Criteria:
- Implement Weaviate-based full-text search for works
- Index work titles, content, and metadata
- Support multi-language search (Russian, English, Tatar)
- Search returns relevance-ranked results
- Support filtering by language, category, tags, authors
- Support date range filtering
- Search response time < 200ms for 95th percentile
- Handle special characters and diacritics correctly
Technical Tasks:
- Complete
internal/app/search/service.goimplementation - Implement Weaviate schema for Works, Translations, Authors
- Create background indexing job for existing content
- Add incremental indexing on create/update operations
- Implement search query parsing and normalization
- Add search result pagination and sorting
- Create integration tests for search functionality
- Add search metrics and monitoring
Dependencies:
- Weaviate instance running (already in docker-compose)
internal/platform/searchclient (exists)internal/domain/searchinterfaces (exists)
Definition of Done:
- All acceptance criteria met
- Unit tests passing (>80% coverage)
- Integration tests with real Weaviate instance
- Performance benchmarks documented
- Search analytics tracked
Story 1.2: Advanced Search Filters
Priority: P1 (High)
Estimate: 5 story points (1-2 days)
Labels: enhancement, search, backend
User Story:
As a researcher or literary enthusiast,
I want to filter search results by multiple criteria simultaneously,
So that I can narrow down to exactly the works I'm interested in.
Acceptance Criteria:
- Filter by literature type (poetry, prose, drama)
- Filter by time period (creation date ranges)
- Filter by multiple authors simultaneously
- Filter by genre/categories
- Filter by language availability
- Combine filters with AND/OR logic
- Save search filters as presets (future)
Technical Tasks:
- Extend
SearchFiltersdomain model - Implement filter translation to Weaviate queries
- Add faceted search capabilities
- Implement filter validation
- Add filter combination logic
- Create filter preset storage (optional)
- Add tests for all filter combinations
🎯 EPIC 2: API Documentation (HIGH PRIORITY)
Story 2.1: Comprehensive GraphQL API Documentation
Priority: P1 (High)
Estimate: 5 story points (1-2 days)
Labels: documentation, api, devex
User Story:
As a frontend developer or API consumer,
I want complete documentation for all GraphQL queries and mutations,
So that I can integrate with the API without constantly asking questions.
Acceptance Criteria:
- Document all 80+ GraphQL resolvers
- Include example queries for each operation
- Document input types and validation rules
- Provide error response examples
- Document authentication requirements
- Include rate limiting information
- Add GraphQL Playground with example queries
- Auto-generate docs from schema annotations
Technical Tasks:
- Add descriptions to all GraphQL types in schema
- Document each query/mutation with examples
- Create
api/README.mdwith comprehensive guide - Add inline schema documentation
- Set up GraphQL Voyager for schema visualization
- Create API changelog
- Add versioning documentation
- Generate OpenAPI spec for REST endpoints (if any)
Deliverables:
api/README.md- Complete API guideapi/EXAMPLES.md- Query examplesapi/CHANGELOG.md- API version history- Enhanced GraphQL schema with descriptions
- Interactive API explorer
Story 2.2: Developer Onboarding Documentation
Priority: P1 (High)
Estimate: 3 story points (1 day)
Labels: documentation, devex
User Story:
As a new developer joining the project,
I want clear setup instructions and architecture documentation,
So that I can become productive quickly without extensive hand-holding.
Acceptance Criteria:
- Updated
README.mdwith quick start guide - Architecture diagrams and explanations
- Development workflow documentation
- Testing strategy documentation
- Contribution guidelines
- Code style guide
- Troubleshooting common issues
Technical Tasks:
- Update root
README.mdwith modern structure - Create
docs/ARCHITECTURE.mdwith diagrams - Document CQRS and DDD patterns used
- Create
docs/DEVELOPMENT.mdworkflow guide - Document testing strategy in
docs/TESTING.md - Create
CONTRIBUTING.mdguide - Add package-level
README.mdfor complex packages
Deliverables:
- Refreshed
README.md docs/ARCHITECTURE.mddocs/DEVELOPMENT.mddocs/TESTING.mdCONTRIBUTING.md
🎯 EPIC 3: Observability & Monitoring (CRITICAL FOR PRODUCTION)
Story 3.1: Distributed Tracing with OpenTelemetry
Priority: P0 (Critical)
Estimate: 8 story points (2-3 days)
Labels: observability, monitoring, infrastructure
User Story:
As a DevOps engineer monitoring production,
I want distributed tracing across all services and database calls,
So that I can quickly identify performance bottlenecks and errors.
Acceptance Criteria:
- OpenTelemetry SDK integrated
- Automatic trace context propagation
- All HTTP handlers instrumented
- All database queries traced
- All GraphQL resolvers traced
- Custom spans for business logic
- Traces exported to OTLP collector
- Integration with Jaeger/Tempo
Technical Tasks:
- Add OpenTelemetry Go SDK dependencies
- Create
internal/observability/tracingpackage - Instrument HTTP middleware with auto-tracing
- Add database query tracing via GORM callbacks
- Instrument GraphQL execution
- Add custom spans for slow operations
- Set up trace sampling strategy
- Configure OTLP exporter
- Add Jaeger to docker-compose for local dev
- Document tracing best practices
Configuration:
// Example trace configuration
type TracingConfig struct {
Enabled bool
ServiceName string
SamplingRate float64
OTLPEndpoint string
}
Story 3.2: Prometheus Metrics & Alerting
Priority: P0 (Critical)
Estimate: 5 story points (1-2 days)
Labels: observability, monitoring, metrics
User Story:
As a site reliability engineer,
I want detailed metrics on API performance and system health,
So that I can detect issues before they impact users.
Acceptance Criteria:
- HTTP request metrics (latency, status codes, throughput)
- Database query metrics (query time, connection pool)
- Business metrics (works created, searches performed)
- System metrics (memory, CPU, goroutines)
- GraphQL-specific metrics (resolver performance)
- Metrics exposed on
/metricsendpoint - Prometheus scraping configured
- Grafana dashboards created
Technical Tasks:
- Enhance existing Prometheus middleware
- Add HTTP handler metrics (already partially done)
- Add database query duration histograms
- Create business metric counters
- Add GraphQL resolver metrics
- Create custom metrics for critical paths
- Set up metric labels strategy
- Create Grafana dashboard JSON
- Define SLOs and SLIs
- Create alerting rules YAML
Key Metrics:
# HTTP Metrics
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}
# Database Metrics
db_query_duration_seconds{query}
db_connections_current
db_connections_max
# Business Metrics
works_created_total{language}
searches_performed_total{type}
user_registrations_total
# GraphQL Metrics
graphql_resolver_duration_seconds{operation, resolver}
graphql_errors_total{operation, error_type}
Story 3.3: Structured Logging Enhancements
Priority: P1 (High)
Estimate: 3 story points (1 day)
Labels: observability, logging
User Story:
As a developer debugging production issues,
I want rich, structured logs with request context,
So that I can quickly trace requests and identify root causes.
Acceptance Criteria:
- Request ID in all logs
- User ID in authenticated request logs
- Trace ID/Span ID in all logs
- Consistent log levels across codebase
- Sensitive data excluded from logs
- Structured fields for easy parsing
- Log sampling for high-volume endpoints
Technical Tasks:
- Enhance HTTP middleware to inject request ID
- Add user ID to context from JWT
- Add trace/span IDs to logger context
- Audit all logging statements for consistency
- Add field name constants for structured logging
- Implement log redaction for passwords/tokens
- Add log sampling configuration
- Create log aggregation guide (ELK/Loki)
Log Format Example:
{
"level": "info",
"ts": "2025-11-27T10:30:45.123Z",
"msg": "Work created successfully",
"request_id": "req_abc123",
"user_id": "user_456",
"trace_id": "trace_xyz789",
"span_id": "span_def321",
"work_id": 789,
"language": "en",
"duration_ms": 45
}
🎯 EPIC 4: Performance Optimization (MEDIUM PRIORITY)
Story 4.1: Read Models (DTOs) for Efficient Queries
Priority: P1 (High)
Estimate: 8 story points (2-3 days)
Labels: performance, architecture, refactoring
User Story:
As an API consumer,
I want fast query responses with only the data I need,
So that my application loads quickly and uses less bandwidth.
Acceptance Criteria:
- Create DTOs for all list queries
- DTOs include only fields needed by API
- Avoid N+1 queries with proper joins
- Reduce payload size by 30-50%
- Query response time improved by 20%
- No breaking changes to GraphQL schema
Technical Tasks:
- Create
internal/app/work/dtopackage - Define WorkListDTO, WorkDetailDTO
- Create TranslationListDTO, TranslationDetailDTO
- Define AuthorListDTO, AuthorDetailDTO
- Implement optimized SQL queries for DTOs
- Update query services to return DTOs
- Update GraphQL resolvers to map DTOs
- Add benchmarks comparing old vs new
- Update tests to use DTOs
- Document DTO usage patterns
Example DTO:
// WorkListDTO - Optimized for list views
type WorkListDTO struct {
ID uint
Title string
AuthorName string
AuthorID uint
Language string
CreatedAt time.Time
ViewCount int
LikeCount int
TranslationCount int
}
// WorkDetailDTO - Full information for single work
type WorkDetailDTO struct {
*WorkListDTO
Content string
Description string
Tags []string
Categories []string
Translations []TranslationSummaryDTO
Author AuthorSummaryDTO
Analytics WorkAnalyticsDTO
}
Story 4.2: Redis Caching Strategy
Priority: P1 (High)
Estimate: 5 story points (1-2 days)
Labels: performance, caching, infrastructure
User Story:
As a user browsing popular works,
I want instant page loads for frequently accessed content,
So that I have a smooth, responsive experience.
Acceptance Criteria:
- Cache hot works (top 100 viewed)
- Cache author profiles
- Cache search results (5 min TTL)
- Cache translations by work ID
- Automatic cache invalidation on updates
- Cache hit rate > 70% for reads
- Cache warming for popular content
- Redis failover doesn't break app
Technical Tasks:
- Refactor
internal/data/cachewith decorator pattern - Create
CachedWorkRepositorydecorator - Implement cache-aside pattern
- Add cache key versioning strategy
- Implement selective cache invalidation
- Add cache metrics (hit/miss rates)
- Create cache warming job
- Handle cache failures gracefully
- Document caching strategy
- Add cache configuration
Cache Key Strategy:
work:{version}:{id}
author:{version}:{id}
translation:{version}:{work_id}:{lang}
search:{version}:{query_hash}
trending:{period}
Story 4.3: Database Query Optimization
Priority: P2 (Medium)
Estimate: 5 story points (1-2 days)
Labels: performance, database
User Story:
As a user with slow internet,
I want database operations to complete quickly,
So that I don't experience frustrating delays.
Acceptance Criteria:
- All queries use proper indexes
- No N+1 query problems
- Eager loading for related entities
- Query time < 50ms for 95th percentile
- Connection pool properly sized
- Slow query logging enabled
- Query explain plans documented
Technical Tasks:
- Audit all repository queries
- Add missing database indexes
- Implement eager loading with GORM Preload
- Fix N+1 queries in GraphQL resolvers
- Optimize joins and subqueries
- Add query timeouts
- Configure connection pool settings
- Enable PostgreSQL slow query log
- Create query performance dashboard
- Document query optimization patterns
🎯 EPIC 5: Deployment & DevOps (CRITICAL FOR PRODUCTION)
Story 5.1: Production Deployment Automation
Priority: P0 (Critical)
Estimate: 8 story points (2-3 days)
Labels: devops, deployment, infrastructure
User Story:
As a DevOps engineer,
I want automated, zero-downtime deployments to production,
So that we can ship features safely and frequently.
Acceptance Criteria:
- Automated deployment on tag push
- Blue-green or rolling deployment strategy
- Health checks before traffic routing
- Automatic rollback on failures
- Database migrations run automatically
- Smoke tests after deployment
- Deployment notifications (Slack/Discord)
- Deployment dashboard
Technical Tasks:
- Complete
.github/workflows/deploy.ymlimplementation - Set up staging environment
- Implement blue-green deployment strategy
- Add health check endpoints (
/health,/ready) - Create database migration runner
- Add pre-deployment smoke tests
- Configure load balancer for zero-downtime
- Set up deployment notifications
- Create rollback procedures
- Document deployment process
Health Check Endpoints:
GET /health -> {"status": "ok", "version": "1.2.3"}
GET /ready -> {"ready": true, "db": "ok", "redis": "ok"}
GET /metrics -> Prometheus metrics
Story 5.2: Infrastructure as Code (Kubernetes)
Priority: P1 (High)
Estimate: 8 story points (2-3 days)
Labels: devops, infrastructure, k8s
User Story:
As a platform engineer,
I want all infrastructure defined as code,
So that environments are reproducible and version-controlled.
Acceptance Criteria:
- Kubernetes manifests for all services
- Helm charts for easy deployment
- ConfigMaps for configuration
- Secrets management with sealed secrets
- Horizontal Pod Autoscaling configured
- Ingress with TLS termination
- Persistent volumes for PostgreSQL/Redis
- Network policies for security
Technical Tasks:
- Enhance
deploy/k8smanifests - Create Deployment YAML for backend
- Create Service and Ingress YAMLs
- Create ConfigMap for app configuration
- Set up Sealed Secrets for sensitive data
- Create HorizontalPodAutoscaler
- Add resource limits and requests
- Create StatefulSets for databases
- Set up persistent volume claims
- Create Helm chart structure
- Document Kubernetes deployment
File Structure:
deploy/k8s/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ └── hpa.yaml
├── overlays/
│ ├── staging/
│ └── production/
└── helm/
└── tercul-backend/
├── Chart.yaml
├── values.yaml
└── templates/
Story 5.3: Disaster Recovery & Backups
Priority: P1 (High)
Estimate: 5 story points (1-2 days)
Labels: devops, backup, disaster-recovery
User Story:
As a business owner,
I want automated backups and disaster recovery procedures,
So that we never lose user data or have extended outages.
Acceptance Criteria:
- Daily PostgreSQL backups
- Point-in-time recovery capability
- Backup retention policy (30 days)
- Backup restoration tested monthly
- Backup encryption at rest
- Off-site backup storage
- Disaster recovery runbook
- RTO < 1 hour, RPO < 15 minutes
Technical Tasks:
- Set up automated database backups
- Configure WAL archiving for PostgreSQL
- Implement backup retention policy
- Store backups in S3/GCS with encryption
- Create backup restoration script
- Test restoration procedure
- Create disaster recovery runbook
- Set up backup monitoring and alerts
- Document backup procedures
- Schedule regular DR drills
🎯 EPIC 6: Security Hardening (HIGH PRIORITY)
Story 6.1: Security Audit & Vulnerability Scanning
Priority: P0 (Critical)
Estimate: 5 story points (1-2 days)
Labels: security, compliance
User Story:
As a security officer,
I want continuous vulnerability scanning and security best practices,
So that user data and the platform remain secure.
Acceptance Criteria:
- Dependency scanning with Dependabot (already active)
- SAST scanning with CodeQL
- Container scanning with Trivy
- No high/critical vulnerabilities
- Security headers configured
- Rate limiting on all endpoints
- Input validation on all mutations
- SQL injection prevention verified
Technical Tasks:
- Review existing security workflows (already good!)
- Add rate limiting middleware
- Implement input validation with go-playground/validator
- Add security headers middleware
- Audit SQL queries for injection risks
- Review JWT implementation for best practices
- Add CSRF protection for mutations
- Implement request signing for sensitive operations
- Create security incident response plan
- Document security practices
Security Headers:
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000
Content-Security-Policy: default-src 'self'
Story 6.2: API Rate Limiting & Throttling
Priority: P1 (High)
Estimate: 3 story points (1 day)
Labels: security, performance, api
User Story:
As a platform operator,
I want rate limiting to prevent abuse and ensure fair usage,
So that all users have a good experience and our infrastructure isn't overwhelmed.
Acceptance Criteria:
- Rate limiting per user (authenticated)
- Rate limiting per IP (anonymous)
- Different limits for different operations
- 429 status code with retry-after header
- Rate limit info in response headers
- Configurable rate limits
- Redis-based distributed rate limiting
- Rate limit metrics and monitoring
Technical Tasks:
- Implement rate limiting middleware
- Use redis for distributed rate limiting
- Configure different limits for read/write
- Add rate limit headers to responses
- Create rate limit exceeded error handling
- Add rate limit bypass for admins
- Monitor rate limit usage
- Document rate limits in API docs
- Add tests for rate limiting
- Create rate limit dashboard
Rate Limits:
Authenticated Users:
- 1000 requests/hour (general)
- 100 writes/hour (mutations)
- 10 searches/minute
Anonymous Users:
- 100 requests/hour
- 10 writes/hour
- 5 searches/minute
🎯 EPIC 7: Developer Experience (MEDIUM PRIORITY)
Story 7.1: Local Development Environment Improvements
Priority: P2 (Medium)
Estimate: 3 story points (1 day)
Labels: devex, tooling
User Story:
As a developer,
I want a fast, reliable local development environment,
So that I can iterate quickly without friction.
Acceptance Criteria:
- One-command setup (
make setup) - Hot reload for Go code changes
- Database seeding with realistic data
- GraphQL Playground pre-configured
- All services start reliably
- Clear error messages when setup fails
- Development docs up-to-date
Technical Tasks:
- Create comprehensive
make setuptarget - Add air for hot reload in docker-compose
- Create database seeding script
- Add sample data fixtures
- Pre-configure GraphQL Playground
- Add health check script
- Improve error messages in Makefile
- Document common setup issues
- Create troubleshooting guide
- Add setup validation script
Story 7.2: Testing Infrastructure Improvements
Priority: P2 (Medium)
Estimate: 5 story points (1-2 days)
Labels: testing, devex
User Story:
As a developer writing tests,
I want fast, reliable test execution without external dependencies,
So that I can practice TDD effectively.
Acceptance Criteria:
- Unit tests run in <5 seconds
- Integration tests isolated with test containers
- Parallel test execution
- Test coverage reports
- Fixtures for common test scenarios
- Clear test failure messages
- Easy to run single test or package
Technical Tasks:
- Refactor
internal/testutilfor better isolation - Implement test containers for integration tests
- Add parallel test execution
- Create reusable test fixtures
- Set up coverage reporting
- Add golden file testing utilities
- Create test data builders
- Improve test naming conventions
- Document testing best practices
- Add
make test-fastandmake test-all
📋 Task Summary & Prioritization
Sprint 1 (Week 1): Critical Production Readiness
- Search Implementation (Story 1.1) - 8 pts
- Distributed Tracing (Story 3.1) - 8 pts
- Prometheus Metrics (Story 3.2) - 5 pts
- Total: 21 points
Sprint 2 (Week 2): Performance & Documentation
- API Documentation (Story 2.1) - 5 pts
- Read Models/DTOs (Story 4.1) - 8 pts
- Redis Caching (Story 4.2) - 5 pts
- Structured Logging (Story 3.3) - 3 pts
- Total: 21 points
Sprint 3 (Week 3): Deployment & Security
- Production Deployment (Story 5.1) - 8 pts
- Security Audit (Story 6.1) - 5 pts
- Rate Limiting (Story 6.2) - 3 pts
- Developer Docs (Story 2.2) - 3 pts
- Total: 19 points
Sprint 4 (Week 4): Infrastructure & Polish
- Kubernetes IaC (Story 5.2) - 8 pts
- Disaster Recovery (Story 5.3) - 5 pts
- Advanced Search Filters (Story 1.2) - 5 pts
- Total: 18 points
Sprint 5 (Week 5): Optimization & DevEx
- Database Optimization (Story 4.3) - 5 pts
- Local Dev Environment (Story 7.1) - 3 pts
- Testing Infrastructure (Story 7.2) - 5 pts
- Total: 13 points
🎯 Success Metrics
Performance SLOs
- API response time p95 < 200ms
- Search response time p95 < 300ms
- Database query time p95 < 50ms
- Cache hit rate > 70%
Reliability SLOs
- Uptime > 99.9% (< 8.7 hours downtime/year)
- Error rate < 0.1%
- Mean Time To Recovery < 1 hour
- Zero data loss
Developer Experience
- Setup time < 15 minutes
- Test suite runs < 2 minutes
- Build time < 1 minute
- Documentation completeness > 90%
Next Steps:
- Review and prioritize these tasks with the team
- Create GitHub issues for Sprint 1 tasks
- Add tasks to project board
- Begin implementation starting with search and observability
This is a realistic, achievable roadmap based on the ACTUAL current state of the codebase! 🚀