mukimovd/tercul-backend

Fork 0

mirror of https://github.com/SamyRai/tercul-backend.git synced 2025-12-27 00:31:35 +00:00

Damir Mukimov b5cd1761af

Update workflows and tasks documentation

2025-11-30 03:13:33 +01:00

25 KiB

Raw Blame History

Tercul Backend - Production Readiness Tasks

Generated: November 27, 2025 Current Status: Most core features implemented, needs production hardening

⚠️ MIGRATED TO GITHUB ISSUES

All production readiness tasks have been migrated to GitHub Issues for better tracking. See issues #30-38 in the repository: https://github.com/SamyRai/backend/issues

This document is kept for reference only and should not be used for task tracking.

📊 Current Reality Check

✅ What's Actually Working

✅ Full GraphQL API with 90%+ resolvers implemented
✅ Complete CQRS pattern (Commands & Queries)
✅ Auth system (Register, Login, JWT, Password Reset, Email Verification)
✅ Work CRUD with authorization
✅ Translation management with analytics
✅ User management and profiles
✅ Collections, Comments, Likes, Bookmarks
✅ Contributions with review workflow
✅ Analytics service (views, likes, trending)
✅ Clean Architecture with DDD patterns
✅ Comprehensive test coverage (passing tests)
✅ CI/CD pipelines (build, test, lint, security, docker)
✅ Docker setup and containerization
✅ Database migrations and schema

⚠️ What Needs Work

⚠️ Search functionality (stub implementation) → Issue #30
⚠️ Observability (metrics, tracing) → Issues #31, #32, #33
⚠️ Production deployment automation → Issue #36
⚠️ Performance optimization → Issues #34, #35
⚠️ Security hardening → Issue #37
⚠️ Infrastructure as Code → Issue #38

🎯 EPIC 1: Search & Discovery (HIGH PRIORITY)

Story 1.1: Full-Text Search Implementation

Priority: P0 (Critical) Estimate: 8 story points (2-3 days) Labels: enhancement, search, backend

User Story:

As a user exploring literary works,
I want to search across works, translations, and authors by keywords,
So that I can quickly find relevant content in my preferred language.

Acceptance Criteria:

Implement Weaviate-based full-text search for works
Index work titles, content, and metadata
Support multi-language search (Russian, English, Tatar)
Search returns relevance-ranked results
Support filtering by language, category, tags, authors
Support date range filtering
Search response time < 200ms for 95th percentile
Handle special characters and diacritics correctly

Technical Tasks:

Complete internal/app/search/service.go implementation
Implement Weaviate schema for Works, Translations, Authors
Create background indexing job for existing content
Add incremental indexing on create/update operations
Implement search query parsing and normalization
Add search result pagination and sorting
Create integration tests for search functionality
Add search metrics and monitoring

Dependencies:

Weaviate instance running (already in docker-compose)
internal/platform/search client (exists)
internal/domain/search interfaces (exists)

Definition of Done:

All acceptance criteria met
Unit tests passing (>80% coverage)
Integration tests with real Weaviate instance
Performance benchmarks documented
Search analytics tracked

Story 1.2: Advanced Search Filters

Priority: P1 (High) Estimate: 5 story points (1-2 days) Labels: enhancement, search, backend

User Story:

As a researcher or literary enthusiast,
I want to filter search results by multiple criteria simultaneously,
So that I can narrow down to exactly the works I'm interested in.

Acceptance Criteria:

Filter by literature type (poetry, prose, drama)
Filter by time period (creation date ranges)
Filter by multiple authors simultaneously
Filter by genre/categories
Filter by language availability
Combine filters with AND/OR logic
Save search filters as presets (future)

Technical Tasks:

Extend SearchFilters domain model
Implement filter translation to Weaviate queries
Add faceted search capabilities
Implement filter validation
Add filter combination logic
Create filter preset storage (optional)
Add tests for all filter combinations

🎯 EPIC 2: API Documentation (HIGH PRIORITY)

Story 2.1: Comprehensive GraphQL API Documentation

Priority: P1 (High) Estimate: 5 story points (1-2 days) Labels: documentation, api, devex

User Story:

As a frontend developer or API consumer,
I want complete documentation for all GraphQL queries and mutations,
So that I can integrate with the API without constantly asking questions.

Acceptance Criteria:

Document all 80+ GraphQL resolvers
Include example queries for each operation
Document input types and validation rules
Provide error response examples
Document authentication requirements
Include rate limiting information
Add GraphQL Playground with example queries
Auto-generate docs from schema annotations

Technical Tasks:

Add descriptions to all GraphQL types in schema
Document each query/mutation with examples
Create api/README.md with comprehensive guide
Add inline schema documentation
Set up GraphQL Voyager for schema visualization
Create API changelog
Add versioning documentation
Generate OpenAPI spec for REST endpoints (if any)

Deliverables:

api/README.md - Complete API guide
api/EXAMPLES.md - Query examples
api/CHANGELOG.md - API version history
Enhanced GraphQL schema with descriptions
Interactive API explorer

Story 2.2: Developer Onboarding Documentation

Priority: P1 (High) Estimate: 3 story points (1 day) Labels: documentation, devex

User Story:

As a new developer joining the project,
I want clear setup instructions and architecture documentation,
So that I can become productive quickly without extensive hand-holding.

Acceptance Criteria:

Updated README.md with quick start guide
Architecture diagrams and explanations
Development workflow documentation
Testing strategy documentation
Contribution guidelines
Code style guide
Troubleshooting common issues

Technical Tasks:

Update root README.md with modern structure
Create docs/ARCHITECTURE.md with diagrams
Document CQRS and DDD patterns used
Create docs/DEVELOPMENT.md workflow guide
Document testing strategy in docs/TESTING.md
Create CONTRIBUTING.md guide
Add package-level README.md for complex packages

Deliverables:

Refreshed README.md
docs/ARCHITECTURE.md
docs/DEVELOPMENT.md
docs/TESTING.md
CONTRIBUTING.md

🎯 EPIC 3: Observability & Monitoring (CRITICAL FOR PRODUCTION)

Story 3.1: Distributed Tracing with OpenTelemetry

Priority: P0 (Critical) Estimate: 8 story points (2-3 days) Labels: observability, monitoring, infrastructure

User Story:

As a DevOps engineer monitoring production,
I want distributed tracing across all services and database calls,
So that I can quickly identify performance bottlenecks and errors.

Acceptance Criteria:

OpenTelemetry SDK integrated
Automatic trace context propagation
All HTTP handlers instrumented
All database queries traced
All GraphQL resolvers traced
Custom spans for business logic
Traces exported to OTLP collector
Integration with Jaeger/Tempo

Technical Tasks:

Add OpenTelemetry Go SDK dependencies
Create internal/observability/tracing package
Instrument HTTP middleware with auto-tracing
Add database query tracing via GORM callbacks
Instrument GraphQL execution
Add custom spans for slow operations
Set up trace sampling strategy
Configure OTLP exporter
Add Jaeger to docker-compose for local dev
Document tracing best practices

Configuration:

// Example trace configuration
type TracingConfig struct {
    Enabled       bool
    ServiceName   string
    SamplingRate  float64
    OTLPEndpoint  string
}

Story 3.2: Prometheus Metrics & Alerting

Priority: P0 (Critical) Estimate: 5 story points (1-2 days) Labels: observability, monitoring, metrics

User Story:

As a site reliability engineer,
I want detailed metrics on API performance and system health,
So that I can detect issues before they impact users.

Acceptance Criteria:

HTTP request metrics (latency, status codes, throughput)
Database query metrics (query time, connection pool)
Business metrics (works created, searches performed)
System metrics (memory, CPU, goroutines)
GraphQL-specific metrics (resolver performance)
Metrics exposed on /metrics endpoint
Prometheus scraping configured
Grafana dashboards created

Technical Tasks:

Enhance existing Prometheus middleware
Add HTTP handler metrics (already partially done)
Add database query duration histograms
Create business metric counters
Add GraphQL resolver metrics
Create custom metrics for critical paths
Set up metric labels strategy
Create Grafana dashboard JSON
Define SLOs and SLIs
Create alerting rules YAML

Key Metrics:

# HTTP Metrics
http_requests_total{method, path, status}
http_request_duration_seconds{method, path}

# Database Metrics
db_query_duration_seconds{query}
db_connections_current
db_connections_max

# Business Metrics
works_created_total{language}
searches_performed_total{type}
user_registrations_total

# GraphQL Metrics
graphql_resolver_duration_seconds{operation, resolver}
graphql_errors_total{operation, error_type}

Story 3.3: Structured Logging Enhancements

Priority: P1 (High) Estimate: 3 story points (1 day) Labels: observability, logging

User Story:

As a developer debugging production issues,
I want rich, structured logs with request context,
So that I can quickly trace requests and identify root causes.

Acceptance Criteria:

Request ID in all logs
User ID in authenticated request logs
Trace ID/Span ID in all logs
Consistent log levels across codebase
Sensitive data excluded from logs
Structured fields for easy parsing
Log sampling for high-volume endpoints

Technical Tasks:

Enhance HTTP middleware to inject request ID
Add user ID to context from JWT
Add trace/span IDs to logger context
Audit all logging statements for consistency
Add field name constants for structured logging
Implement log redaction for passwords/tokens
Add log sampling configuration
Create log aggregation guide (ELK/Loki)

Log Format Example:

{
  "level": "info",
  "ts": "2025-11-27T10:30:45.123Z",
  "msg": "Work created successfully",
  "request_id": "req_abc123",
  "user_id": "user_456",
  "trace_id": "trace_xyz789",
  "span_id": "span_def321",
  "work_id": 789,
  "language": "en",
  "duration_ms": 45
}

🎯 EPIC 4: Performance Optimization (MEDIUM PRIORITY)

Story 4.1: Read Models (DTOs) for Efficient Queries

Priority: P1 (High) Estimate: 8 story points (2-3 days) Labels: performance, architecture, refactoring

User Story:

As an API consumer,
I want fast query responses with only the data I need,
So that my application loads quickly and uses less bandwidth.

Acceptance Criteria:

Create DTOs for all list queries
DTOs include only fields needed by API
Avoid N+1 queries with proper joins
Reduce payload size by 30-50%
Query response time improved by 20%
No breaking changes to GraphQL schema

Technical Tasks:

Create internal/app/work/dto package
Define WorkListDTO, WorkDetailDTO
Create TranslationListDTO, TranslationDetailDTO
Define AuthorListDTO, AuthorDetailDTO
Implement optimized SQL queries for DTOs
Update query services to return DTOs
Update GraphQL resolvers to map DTOs
Add benchmarks comparing old vs new
Update tests to use DTOs
Document DTO usage patterns

Example DTO:

// WorkListDTO - Optimized for list views
type WorkListDTO struct {
    ID              uint
    Title           string
    AuthorName      string
    AuthorID        uint
    Language        string
    CreatedAt       time.Time
    ViewCount       int
    LikeCount       int
    TranslationCount int
}

// WorkDetailDTO - Full information for single work
type WorkDetailDTO struct {
    *WorkListDTO
    Content         string
    Description     string
    Tags            []string
    Categories      []string
    Translations    []TranslationSummaryDTO
    Author          AuthorSummaryDTO
    Analytics       WorkAnalyticsDTO
}

Story 4.2: Redis Caching Strategy

Priority: P1 (High) Estimate: 5 story points (1-2 days) Labels: performance, caching, infrastructure

User Story:

As a user browsing popular works,
I want instant page loads for frequently accessed content,
So that I have a smooth, responsive experience.

Acceptance Criteria:

Cache hot works (top 100 viewed)
Cache author profiles
Cache search results (5 min TTL)
Cache translations by work ID
Automatic cache invalidation on updates
Cache hit rate > 70% for reads
Cache warming for popular content
Redis failover doesn't break app

Technical Tasks:

Refactor internal/data/cache with decorator pattern
Create CachedWorkRepository decorator
Implement cache-aside pattern
Add cache key versioning strategy
Implement selective cache invalidation
Add cache metrics (hit/miss rates)
Create cache warming job
Handle cache failures gracefully
Document caching strategy
Add cache configuration

Cache Key Strategy:

work:{version}:{id}
author:{version}:{id}
translation:{version}:{work_id}:{lang}
search:{version}:{query_hash}
trending:{period}

Story 4.3: Database Query Optimization

Priority: P2 (Medium) Estimate: 5 story points (1-2 days) Labels: performance, database

User Story:

As a user with slow internet,
I want database operations to complete quickly,
So that I don't experience frustrating delays.

Acceptance Criteria:

All queries use proper indexes
No N+1 query problems
Eager loading for related entities
Query time < 50ms for 95th percentile
Connection pool properly sized
Slow query logging enabled
Query explain plans documented

Technical Tasks:

Audit all repository queries
Add missing database indexes
Implement eager loading with GORM Preload
Fix N+1 queries in GraphQL resolvers
Optimize joins and subqueries
Add query timeouts
Configure connection pool settings
Enable PostgreSQL slow query log
Create query performance dashboard
Document query optimization patterns

🎯 EPIC 5: Deployment & DevOps (CRITICAL FOR PRODUCTION)

Story 5.1: Production Deployment Automation

Priority: P0 (Critical) Estimate: 8 story points (2-3 days) Labels: devops, deployment, infrastructure

User Story:

As a DevOps engineer,
I want automated, zero-downtime deployments to production,
So that we can ship features safely and frequently.

Acceptance Criteria:

Automated deployment on tag push
Blue-green or rolling deployment strategy
Health checks before traffic routing
Automatic rollback on failures
Database migrations run automatically
Smoke tests after deployment
Deployment notifications (Slack/Discord)
Deployment dashboard

Technical Tasks:

Complete .github/workflows/deploy.yml implementation
Set up staging environment
Implement blue-green deployment strategy
Add health check endpoints (/health, /ready)
Create database migration runner
Add pre-deployment smoke tests
Configure load balancer for zero-downtime
Set up deployment notifications
Create rollback procedures
Document deployment process

Health Check Endpoints:

GET /health       -> {"status": "ok", "version": "1.2.3"}
GET /ready        -> {"ready": true, "db": "ok", "redis": "ok"}
GET /metrics      -> Prometheus metrics

Story 5.2: Infrastructure as Code (Kubernetes)

Priority: P1 (High) Estimate: 8 story points (2-3 days) Labels: devops, infrastructure, k8s

User Story:

As a platform engineer,
I want all infrastructure defined as code,
So that environments are reproducible and version-controlled.

Acceptance Criteria:

Kubernetes manifests for all services
Helm charts for easy deployment
ConfigMaps for configuration
Secrets management with sealed secrets
Horizontal Pod Autoscaling configured
Ingress with TLS termination
Persistent volumes for PostgreSQL/Redis
Network policies for security

Technical Tasks:

Enhance deploy/k8s manifests
Create Deployment YAML for backend
Create Service and Ingress YAMLs
Create ConfigMap for app configuration
Set up Sealed Secrets for sensitive data
Create HorizontalPodAutoscaler
Add resource limits and requests
Create StatefulSets for databases
Set up persistent volume claims
Create Helm chart structure
Document Kubernetes deployment

File Structure:

deploy/k8s/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   └── hpa.yaml
├── overlays/
│   ├── staging/
│   └── production/
└── helm/
    └── tercul-backend/
        ├── Chart.yaml
        ├── values.yaml
        └── templates/

Story 5.3: Disaster Recovery & Backups

Priority: P1 (High) Estimate: 5 story points (1-2 days) Labels: devops, backup, disaster-recovery

User Story:

As a business owner,
I want automated backups and disaster recovery procedures,
So that we never lose user data or have extended outages.

Acceptance Criteria:

Daily PostgreSQL backups
Point-in-time recovery capability
Backup retention policy (30 days)
Backup restoration tested monthly
Backup encryption at rest
Off-site backup storage
Disaster recovery runbook
RTO < 1 hour, RPO < 15 minutes

Technical Tasks:

Set up automated database backups
Configure WAL archiving for PostgreSQL
Implement backup retention policy
Store backups in S3/GCS with encryption
Create backup restoration script
Test restoration procedure
Create disaster recovery runbook
Set up backup monitoring and alerts
Document backup procedures
Schedule regular DR drills

🎯 EPIC 6: Security Hardening (HIGH PRIORITY)

Story 6.1: Security Audit & Vulnerability Scanning

Priority: P0 (Critical) Estimate: 5 story points (1-2 days) Labels: security, compliance

User Story:

As a security officer,
I want continuous vulnerability scanning and security best practices,
So that user data and the platform remain secure.

Acceptance Criteria:

Dependency scanning with Dependabot (already active)
SAST scanning with CodeQL
Container scanning with Trivy
No high/critical vulnerabilities
Security headers configured
Rate limiting on all endpoints
Input validation on all mutations
SQL injection prevention verified

Technical Tasks:

Review existing security workflows (already good!)
Add rate limiting middleware
Implement input validation with go-playground/validator
Add security headers middleware
Audit SQL queries for injection risks
Review JWT implementation for best practices
Add CSRF protection for mutations
Implement request signing for sensitive operations
Create security incident response plan
Document security practices

Security Headers:

X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000
Content-Security-Policy: default-src 'self'

Story 6.2: API Rate Limiting & Throttling

Priority: P1 (High) Estimate: 3 story points (1 day) Labels: security, performance, api

User Story:

As a platform operator,
I want rate limiting to prevent abuse and ensure fair usage,
So that all users have a good experience and our infrastructure isn't overwhelmed.

Acceptance Criteria:

Rate limiting per user (authenticated)
Rate limiting per IP (anonymous)
Different limits for different operations
429 status code with retry-after header
Rate limit info in response headers
Configurable rate limits
Redis-based distributed rate limiting
Rate limit metrics and monitoring

Technical Tasks:

Implement rate limiting middleware
Use redis for distributed rate limiting
Configure different limits for read/write
Add rate limit headers to responses
Create rate limit exceeded error handling
Add rate limit bypass for admins
Monitor rate limit usage
Document rate limits in API docs
Add tests for rate limiting
Create rate limit dashboard

Rate Limits:

Authenticated Users:
- 1000 requests/hour (general)
- 100 writes/hour (mutations)
- 10 searches/minute

Anonymous Users:
- 100 requests/hour
- 10 writes/hour
- 5 searches/minute

🎯 EPIC 7: Developer Experience (MEDIUM PRIORITY)

Story 7.1: Local Development Environment Improvements

Priority: P2 (Medium) Estimate: 3 story points (1 day) Labels: devex, tooling

User Story:

As a developer,
I want a fast, reliable local development environment,
So that I can iterate quickly without friction.

Acceptance Criteria:

One-command setup (make setup)
Hot reload for Go code changes
Database seeding with realistic data
GraphQL Playground pre-configured
All services start reliably
Clear error messages when setup fails
Development docs up-to-date

Technical Tasks:

Create comprehensive make setup target
Add air for hot reload in docker-compose
Create database seeding script
Add sample data fixtures
Pre-configure GraphQL Playground
Add health check script
Improve error messages in Makefile
Document common setup issues
Create troubleshooting guide
Add setup validation script

Story 7.2: Testing Infrastructure Improvements

Priority: P2 (Medium) Estimate: 5 story points (1-2 days) Labels: testing, devex

User Story:

As a developer writing tests,
I want fast, reliable test execution without external dependencies,
So that I can practice TDD effectively.

Acceptance Criteria:

Unit tests run in <5 seconds
Integration tests isolated with test containers
Parallel test execution
Test coverage reports
Fixtures for common test scenarios
Clear test failure messages
Easy to run single test or package

Technical Tasks:

Refactor internal/testutil for better isolation
Implement test containers for integration tests
Add parallel test execution
Create reusable test fixtures
Set up coverage reporting
Add golden file testing utilities
Create test data builders
Improve test naming conventions
Document testing best practices
Add make test-fast and make test-all

📋 Task Summary & Prioritization

Sprint 1 (Week 1): Critical Production Readiness

Search Implementation (Story 1.1) - 8 pts
Distributed Tracing (Story 3.1) - 8 pts
Prometheus Metrics (Story 3.2) - 5 pts
Total: 21 points

Sprint 2 (Week 2): Performance & Documentation

API Documentation (Story 2.1) - 5 pts
Read Models/DTOs (Story 4.1) - 8 pts
Redis Caching (Story 4.2) - 5 pts
Structured Logging (Story 3.3) - 3 pts
Total: 21 points

Sprint 3 (Week 3): Deployment & Security

Production Deployment (Story 5.1) - 8 pts
Security Audit (Story 6.1) - 5 pts
Rate Limiting (Story 6.2) - 3 pts
Developer Docs (Story 2.2) - 3 pts
Total: 19 points

Sprint 4 (Week 4): Infrastructure & Polish

Kubernetes IaC (Story 5.2) - 8 pts
Disaster Recovery (Story 5.3) - 5 pts
Advanced Search Filters (Story 1.2) - 5 pts
Total: 18 points

Sprint 5 (Week 5): Optimization & DevEx

Database Optimization (Story 4.3) - 5 pts
Local Dev Environment (Story 7.1) - 3 pts
Testing Infrastructure (Story 7.2) - 5 pts
Total: 13 points

🎯 Success Metrics

Performance SLOs

API response time p95 < 200ms
Search response time p95 < 300ms
Database query time p95 < 50ms
Cache hit rate > 70%

Reliability SLOs

Uptime > 99.9% (< 8.7 hours downtime/year)
Error rate < 0.1%
Mean Time To Recovery < 1 hour
Zero data loss

Developer Experience

Setup time < 15 minutes
Test suite runs < 2 minutes
Build time < 1 minute
Documentation completeness > 90%

Next Steps:

Review and prioritize these tasks with the team
Create GitHub issues for Sprint 1 tasks
Add tasks to project board
Begin implementation starting with search and observability

This is a realistic, achievable roadmap based on the ACTUAL current state of the codebase! 🚀

25 KiB Raw Blame History

Tercul Backend - Production Readiness Tasks

📊 Current Reality Check

✅ What's Actually Working

⚠️ What Needs Work

🎯 EPIC 1: Search & Discovery (HIGH PRIORITY)

Story 1.1: Full-Text Search Implementation

Story 1.2: Advanced Search Filters

🎯 EPIC 2: API Documentation (HIGH PRIORITY)

Story 2.1: Comprehensive GraphQL API Documentation

Story 2.2: Developer Onboarding Documentation

🎯 EPIC 3: Observability & Monitoring (CRITICAL FOR PRODUCTION)

Story 3.1: Distributed Tracing with OpenTelemetry

Story 3.2: Prometheus Metrics & Alerting

Story 3.3: Structured Logging Enhancements

🎯 EPIC 4: Performance Optimization (MEDIUM PRIORITY)

Story 4.1: Read Models (DTOs) for Efficient Queries

Story 4.2: Redis Caching Strategy

Story 4.3: Database Query Optimization

🎯 EPIC 5: Deployment & DevOps (CRITICAL FOR PRODUCTION)

Story 5.1: Production Deployment Automation

Story 5.2: Infrastructure as Code (Kubernetes)

Story 5.3: Disaster Recovery & Backups

🎯 EPIC 6: Security Hardening (HIGH PRIORITY)

Story 6.1: Security Audit & Vulnerability Scanning

Story 6.2: API Rate Limiting & Throttling

🎯 EPIC 7: Developer Experience (MEDIUM PRIORITY)

Story 7.1: Local Development Environment Improvements

Story 7.2: Testing Infrastructure Improvements

📋 Task Summary & Prioritization

Sprint 1 (Week 1): Critical Production Readiness

Sprint 2 (Week 2): Performance & Documentation

Sprint 3 (Week 3): Deployment & Security

Sprint 4 (Week 4): Infrastructure & Polish

Sprint 5 (Week 5): Optimization & DevEx

🎯 Success Metrics

Performance SLOs

Reliability SLOs

Developer Experience

25 KiB

Raw Blame History