mirror of
https://github.com/SamyRai/tercul-backend.git
synced 2025-12-27 04:01:34 +00:00
* docs: Update TASKS.md and PRODUCTION-TASKS.md to reflect current codebase state (December 2024 audit) * refactor: Unify all commands into a single Cobra CLI - Refactor cmd/api/main.go into 'tercul serve' command - Refactor cmd/worker/main.go into 'tercul worker' command - Refactor cmd/tools/enrich/main.go into 'tercul enrich' command - Add 'tercul bleve-migrate' command for Bleve index migration - Extract common initialization logic into cmd/cli/internal/bootstrap - Update Dockerfile to build unified CLI - Update README with new CLI usage This consolidates all entry points into a single, maintainable CLI structure. * fix: Fix CodeQL workflow and add comprehensive test coverage - Fix Go version mismatch by setting up Go before CodeQL init - Add Go version verification step - Improve error handling for code scanning upload - Add comprehensive test suite for CLI commands: - Bleve migration tests with in-memory indexes - Edge case tests (empty data, large batches, errors) - Command-level integration tests - Bootstrap initialization tests - Optimize tests to use in-memory Bleve indexes for speed - Add test tags for skipping slow tests in short mode - Update workflow documentation Test coverage: 18.1% with 806 lines of test code All tests passing in short mode * fix: Fix test workflow and Bleve test double-close panic - Add POSTGRES_USER to PostgreSQL service configuration in test workflow - Fix TestInitBleveIndex double-close panic by removing defer before explicit close - Test now passes successfully Fixes failing Unit Tests workflow in PR #64
1011 lines
28 KiB
Markdown
1011 lines
28 KiB
Markdown
# Tercul Backend - Production Readiness Tasks
|
||
|
||
**Last Updated:** December 2024
|
||
**Current Status:** Core features complete, production hardening in progress
|
||
|
||
> **Note:** This document tracks production readiness tasks. Some tasks may also be tracked in GitHub Issues.
|
||
|
||
---
|
||
|
||
## 📋 Quick Status Summary
|
||
|
||
### ✅ Fully Implemented
|
||
- **GraphQL API:** 100% of resolvers implemented and functional
|
||
- **Search:** Full Weaviate-based search with multi-class support, filtering, hybrid search
|
||
- **Authentication:** Complete auth system (register, login, JWT, password reset, email verification)
|
||
- **Background Jobs:** Sync jobs and linguistic analysis with proper error handling
|
||
- **Basic Observability:** Logging (zerolog), metrics (Prometheus), tracing (OpenTelemetry)
|
||
- **Architecture:** Clean CQRS/DDD architecture with proper DI
|
||
- **Testing:** Comprehensive test coverage with mocks
|
||
|
||
### ⚠️ Needs Production Hardening
|
||
- **Tracing:** Uses stdout exporter, needs OTLP for production
|
||
- **Metrics:** Missing GraphQL resolver metrics and business metrics
|
||
- **Caching:** No repository caching (only linguistics has caching)
|
||
- **DTOs:** Basic DTOs exist but need expansion
|
||
- **Configuration:** Still uses global singleton (`config.Cfg`)
|
||
|
||
### 📝 Documentation Status
|
||
- ✅ Basic API documentation exists (`api/README.md`)
|
||
- ✅ Project README updated
|
||
- ⚠️ Needs enhancement with examples and detailed usage patterns
|
||
|
||
---
|
||
|
||
## 📊 Current Reality Check
|
||
|
||
### ✅ What's Actually Working
|
||
|
||
- ✅ Full GraphQL API with 100% resolvers implemented (all queries and mutations functional)
|
||
- ✅ Complete CQRS pattern (Commands & Queries) with proper separation
|
||
- ✅ Auth system (Register, Login, JWT, Password Reset, Email Verification) - fully implemented
|
||
- ✅ Work CRUD with authorization
|
||
- ✅ Translation management with analytics
|
||
- ✅ User management and profiles
|
||
- ✅ Collections, Comments, Likes, Bookmarks
|
||
- ✅ Contributions with review workflow
|
||
- ✅ Analytics service (views, likes, trending) - basic implementation
|
||
- ✅ **Search functionality** - Fully implemented with Weaviate (multi-class search, filtering, hybrid search)
|
||
- ✅ Clean Architecture with DDD patterns
|
||
- ✅ Comprehensive test coverage (passing tests with mocks)
|
||
- ✅ Basic CI infrastructure (`make lint-test` target)
|
||
- ✅ Docker setup and containerization
|
||
- ✅ Database migrations with goose
|
||
- ✅ Background jobs (sync, linguistic analysis) with proper error handling
|
||
- ✅ Basic observability (logging with zerolog, Prometheus metrics, OpenTelemetry tracing)
|
||
|
||
### ⚠️ What Needs Work
|
||
|
||
- ⚠️ **Observability Production Hardening:** Tracing uses stdout exporter (needs OTLP), missing GraphQL/business metrics → **Issues #31, #32, #33**
|
||
- ⚠️ **Repository Caching:** No caching decorators for repositories (only linguistics has caching) → **Issue #34**
|
||
- ⚠️ **DTO Optimization:** Basic DTOs exist but need expansion for list vs detail views → **Issue #35**
|
||
- ⚠️ **Configuration Refactoring:** Still uses global `config.Cfg` singleton → **Issue #36**
|
||
- ⚠️ Production deployment automation → **Issue #36**
|
||
- ⚠️ Security hardening (rate limiting, security headers) → **Issue #37**
|
||
- ⚠️ Infrastructure as Code (Kubernetes manifests) → **Issue #38**
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 1: Search & Discovery (COMPLETED ✅)
|
||
|
||
### Story 1.1: Full-Text Search Implementation
|
||
|
||
**Priority:** ✅ **COMPLETED**
|
||
**Status:** Fully implemented and functional
|
||
|
||
**Current Implementation:**
|
||
|
||
- ✅ Weaviate-based full-text search fully implemented
|
||
- ✅ Multi-class search (Works, Translations, Authors)
|
||
- ✅ Hybrid search mode (BM25 + Vector) with configurable alpha
|
||
- ✅ Support for filtering by language, tags, dates, authors
|
||
- ✅ Relevance-ranked results with pagination
|
||
- ✅ Search service in `internal/app/search/service.go`
|
||
- ✅ Weaviate client wrapper in `internal/platform/search/weaviate_wrapper.go`
|
||
- ✅ Search schema management in `internal/platform/search/schema.go`
|
||
|
||
**Remaining Enhancements:**
|
||
|
||
- [ ] Add incremental indexing on create/update operations (currently manual sync)
|
||
- [ ] Add search result caching (5 min TTL)
|
||
- [ ] Add search metrics and monitoring
|
||
- [ ] Performance optimization (target < 200ms for 95th percentile)
|
||
- [ ] Integration tests with real Weaviate instance
|
||
|
||
---
|
||
|
||
### Story 1.2: Advanced Search Filters
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `enhancement`, `search`, `backend`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a researcher or literary enthusiast,
|
||
I want to filter search results by multiple criteria simultaneously,
|
||
So that I can narrow down to exactly the works I'm interested in.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Filter by literature type (poetry, prose, drama)
|
||
- [ ] Filter by time period (creation date ranges)
|
||
- [ ] Filter by multiple authors simultaneously
|
||
- [ ] Filter by genre/categories
|
||
- [ ] Filter by language availability
|
||
- [ ] Combine filters with AND/OR logic
|
||
- [ ] Save search filters as presets (future)
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Extend `SearchFilters` domain model
|
||
2. Implement filter translation to Weaviate queries
|
||
3. Add faceted search capabilities
|
||
4. Implement filter validation
|
||
5. Add filter combination logic
|
||
6. Create filter preset storage (optional)
|
||
7. Add tests for all filter combinations
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 2: API Documentation (HIGH PRIORITY)
|
||
|
||
### Story 2.1: Comprehensive GraphQL API Documentation
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `documentation`, `api`, `devex`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a frontend developer or API consumer,
|
||
I want complete documentation for all GraphQL queries and mutations,
|
||
So that I can integrate with the API without constantly asking questions.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Document all 80+ GraphQL resolvers
|
||
- [ ] Include example queries for each operation
|
||
- [ ] Document input types and validation rules
|
||
- [ ] Provide error response examples
|
||
- [ ] Document authentication requirements
|
||
- [ ] Include rate limiting information
|
||
- [ ] Add GraphQL Playground with example queries
|
||
- [ ] Auto-generate docs from schema annotations
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Add descriptions to all GraphQL types in schema
|
||
2. Document each query/mutation with examples
|
||
3. Create `api/README.md` with comprehensive guide
|
||
4. Add inline schema documentation
|
||
5. Set up GraphQL Voyager for schema visualization
|
||
6. Create API changelog
|
||
7. Add versioning documentation
|
||
8. Generate OpenAPI spec for REST endpoints (if any)
|
||
|
||
**Deliverables:**
|
||
|
||
- `api/README.md` - Complete API guide
|
||
- `api/EXAMPLES.md` - Query examples
|
||
- `api/CHANGELOG.md` - API version history
|
||
- Enhanced GraphQL schema with descriptions
|
||
- Interactive API explorer
|
||
|
||
---
|
||
|
||
### Story 2.2: Developer Onboarding Documentation
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 3 story points (1 day)
|
||
**Labels:** `documentation`, `devex`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a new developer joining the project,
|
||
I want clear setup instructions and architecture documentation,
|
||
So that I can become productive quickly without extensive hand-holding.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Updated `README.md` with quick start guide
|
||
- [ ] Architecture diagrams and explanations
|
||
- [ ] Development workflow documentation
|
||
- [ ] Testing strategy documentation
|
||
- [ ] Contribution guidelines
|
||
- [ ] Code style guide
|
||
- [ ] Troubleshooting common issues
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Update root `README.md` with modern structure
|
||
2. Create `docs/ARCHITECTURE.md` with diagrams
|
||
3. Document CQRS and DDD patterns used
|
||
4. Create `docs/DEVELOPMENT.md` workflow guide
|
||
5. Document testing strategy in `docs/TESTING.md`
|
||
6. Create `CONTRIBUTING.md` guide
|
||
7. Add package-level `README.md` for complex packages
|
||
|
||
**Deliverables:**
|
||
|
||
- Refreshed `README.md`
|
||
- `docs/ARCHITECTURE.md`
|
||
- `docs/DEVELOPMENT.md`
|
||
- `docs/TESTING.md`
|
||
- `CONTRIBUTING.md`
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 3: Observability & Monitoring (CRITICAL FOR PRODUCTION)
|
||
|
||
### Story 3.1: Distributed Tracing with OpenTelemetry
|
||
|
||
**Priority:** P0 (Critical)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `observability`, `monitoring`, `infrastructure`
|
||
|
||
**Current State:**
|
||
- ✅ OpenTelemetry SDK integrated
|
||
- ✅ Basic tracer provider exists in `internal/observability/tracing.go`
|
||
- ✅ HTTP middleware with tracing (`observability.TracingMiddleware`)
|
||
- ✅ Trace context propagation configured
|
||
- ⚠️ **Currently uses stdout exporter** (needs OTLP for production)
|
||
- ⚠️ Database query tracing not yet implemented
|
||
- ⚠️ GraphQL resolver tracing not yet implemented
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a DevOps engineer monitoring production,
|
||
I want distributed tracing across all services and database calls,
|
||
So that I can quickly identify performance bottlenecks and errors.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [x] OpenTelemetry SDK integrated
|
||
- [x] Automatic trace context propagation
|
||
- [x] HTTP handlers instrumented
|
||
- [ ] All database queries traced (via GORM callbacks)
|
||
- [ ] All GraphQL resolvers traced
|
||
- [ ] Custom spans for business logic
|
||
- [ ] **Traces exported to OTLP collector** (currently stdout only)
|
||
- [ ] Integration with Jaeger/Tempo
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. ✅ OpenTelemetry Go SDK dependencies (already added)
|
||
2. ✅ `internal/observability/tracing` package exists
|
||
3. ✅ HTTP middleware with auto-tracing
|
||
4. [ ] Add database query tracing via GORM callbacks
|
||
5. [ ] Instrument GraphQL execution
|
||
6. [ ] Add custom spans for slow operations
|
||
7. [ ] Set up trace sampling strategy
|
||
8. [ ] **Replace stdout exporter with OTLP exporter**
|
||
9. [ ] Add Jaeger to docker-compose for local dev
|
||
10. [ ] Document tracing best practices
|
||
|
||
**Configuration:**
|
||
|
||
```go
|
||
// Example trace configuration (needs implementation)
|
||
type TracingConfig struct {
|
||
Enabled bool
|
||
ServiceName string
|
||
SamplingRate float64
|
||
OTLPEndpoint string
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Story 3.2: Prometheus Metrics & Alerting
|
||
|
||
**Priority:** P0 (Critical)
|
||
**Estimate:** 3 story points (1 day)
|
||
**Labels:** `observability`, `monitoring`, `metrics`
|
||
|
||
**Current State:**
|
||
- ✅ Basic Prometheus metrics exist in `internal/observability/metrics.go`
|
||
- ✅ HTTP request metrics (latency, status codes)
|
||
- ✅ Database query metrics (query time, counts)
|
||
- ✅ Metrics exposed on `/metrics` endpoint
|
||
- ⚠️ Missing GraphQL resolver metrics
|
||
- ⚠️ Missing business metrics
|
||
- ⚠️ Missing system metrics
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a site reliability engineer,
|
||
I want detailed metrics on API performance and system health,
|
||
So that I can detect issues before they impact users.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [x] HTTP request metrics (latency, status codes, throughput)
|
||
- [x] Database query metrics (query time, connection pool)
|
||
- [ ] Business metrics (works created, searches performed)
|
||
- [ ] System metrics (memory, CPU, goroutines)
|
||
- [ ] GraphQL-specific metrics (resolver performance)
|
||
- [x] Metrics exposed on `/metrics` endpoint
|
||
- [ ] Prometheus scraping configured
|
||
- [ ] Grafana dashboards created
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. ✅ Prometheus middleware exists
|
||
2. ✅ HTTP handler metrics implemented
|
||
3. ✅ Database query duration histograms exist
|
||
4. [ ] Create business metric counters
|
||
5. [ ] Add GraphQL resolver metrics
|
||
6. [ ] Create custom metrics for critical paths
|
||
7. [ ] Set up metric labels strategy
|
||
8. [ ] Create Grafana dashboard JSON
|
||
9. [ ] Define SLOs and SLIs
|
||
10. [ ] Create alerting rules YAML
|
||
|
||
**Key Metrics:**
|
||
|
||
```
|
||
# HTTP Metrics
|
||
http_requests_total{method, path, status}
|
||
http_request_duration_seconds{method, path}
|
||
|
||
# Database Metrics
|
||
db_query_duration_seconds{query}
|
||
db_connections_current
|
||
db_connections_max
|
||
|
||
# Business Metrics
|
||
works_created_total{language}
|
||
searches_performed_total{type}
|
||
user_registrations_total
|
||
|
||
# GraphQL Metrics
|
||
graphql_resolver_duration_seconds{operation, resolver}
|
||
graphql_errors_total{operation, error_type}
|
||
```
|
||
|
||
---
|
||
|
||
### Story 3.3: Structured Logging Enhancements
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 2 story points (0.5-1 day)
|
||
**Labels:** `observability`, `logging`
|
||
|
||
**Current State:**
|
||
- ✅ Structured logging with zerolog implemented
|
||
- ✅ Request ID middleware exists (`observability.RequestIDMiddleware`)
|
||
- ✅ Trace/Span IDs added to logger context (`Logger.Ctx()`)
|
||
- ✅ Logging middleware injects logger into context
|
||
- ⚠️ User ID not yet added to authenticated request logs
|
||
- ⚠️ Log sampling not implemented
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a developer debugging production issues,
|
||
I want rich, structured logs with request context,
|
||
So that I can quickly trace requests and identify root causes.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [x] Request ID in all logs
|
||
- [ ] User ID in authenticated request logs
|
||
- [x] Trace ID/Span ID in all logs
|
||
- [ ] Consistent log levels across codebase (audit needed)
|
||
- [ ] Sensitive data excluded from logs
|
||
- [x] Structured fields for easy parsing
|
||
- [ ] Log sampling for high-volume endpoints
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. ✅ HTTP middleware injects request ID
|
||
2. [ ] Add user ID to context from JWT in auth middleware
|
||
3. ✅ Trace/span IDs added to logger context
|
||
4. [ ] Audit all logging statements for consistency
|
||
5. [ ] Add field name constants for structured logging
|
||
6. [ ] Implement log redaction for passwords/tokens
|
||
7. [ ] Add log sampling configuration
|
||
8. [ ] Create log aggregation guide (ELK/Loki)
|
||
|
||
**Log Format Example:**
|
||
|
||
```json
|
||
{
|
||
"level": "info",
|
||
"ts": "2025-11-27T10:30:45.123Z",
|
||
"msg": "Work created successfully",
|
||
"request_id": "req_abc123",
|
||
"user_id": "user_456",
|
||
"trace_id": "trace_xyz789",
|
||
"span_id": "span_def321",
|
||
"work_id": 789,
|
||
"language": "en",
|
||
"duration_ms": 45
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 4: Performance Optimization (MEDIUM PRIORITY)
|
||
|
||
### Story 4.1: Read Models (DTOs) for Efficient Queries
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 6 story points (1-2 days)
|
||
**Labels:** `performance`, `architecture`, `refactoring`
|
||
|
||
**Current State:**
|
||
- ✅ Basic DTOs exist (`WorkDTO` in `internal/app/work/dto.go`)
|
||
- ✅ DTOs used in queries (`internal/app/work/queries.go`)
|
||
- ⚠️ DTOs are minimal (only ID, Title, Language)
|
||
- ⚠️ No distinction between list and detail DTOs
|
||
- ⚠️ Other aggregates don't have DTOs yet
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As an API consumer,
|
||
I want fast query responses with only the data I need,
|
||
So that my application loads quickly and uses less bandwidth.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [x] Basic DTOs created for work queries
|
||
- [ ] Create DTOs for all list queries (translation, author, user)
|
||
- [ ] DTOs include only fields needed by API
|
||
- [ ] Avoid N+1 queries with proper joins
|
||
- [ ] Reduce payload size by 30-50%
|
||
- [ ] Query response time improved by 20%
|
||
- [ ] No breaking changes to GraphQL schema
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. ✅ `internal/app/work/dto.go` exists (basic)
|
||
2. [ ] Expand WorkDTO to WorkListDTO and WorkDetailDTO
|
||
3. [ ] Create TranslationListDTO, TranslationDetailDTO
|
||
4. [ ] Define AuthorListDTO, AuthorDetailDTO
|
||
5. [ ] Implement optimized SQL queries for DTOs with joins
|
||
6. [ ] Update query services to return expanded DTOs
|
||
7. [ ] Update GraphQL resolvers to map DTOs (if needed)
|
||
8. [ ] Add benchmarks comparing old vs new
|
||
9. [ ] Update tests to use DTOs
|
||
10. [ ] Document DTO usage patterns
|
||
|
||
**Example DTO (needs expansion):**
|
||
|
||
```go
|
||
// Current minimal DTO
|
||
type WorkDTO struct {
|
||
ID uint
|
||
Title string
|
||
Language string
|
||
}
|
||
|
||
// Target: WorkListDTO - Optimized for list views
|
||
type WorkListDTO struct {
|
||
ID uint
|
||
Title string
|
||
AuthorName string
|
||
AuthorID uint
|
||
Language string
|
||
CreatedAt time.Time
|
||
ViewCount int
|
||
LikeCount int
|
||
TranslationCount int
|
||
}
|
||
|
||
// Target: WorkDetailDTO - Full information for single work
|
||
type WorkDetailDTO struct {
|
||
*WorkListDTO
|
||
Content string
|
||
Description string
|
||
Tags []string
|
||
Categories []string
|
||
Translations []TranslationSummaryDTO
|
||
Author AuthorSummaryDTO
|
||
Analytics WorkAnalyticsDTO
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Story 4.2: Redis Caching Strategy
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `performance`, `caching`, `infrastructure`
|
||
|
||
**Current State:**
|
||
- ✅ Redis client exists in `internal/platform/cache`
|
||
- ✅ Caching implemented for linguistics analysis (`internal/jobs/linguistics/analysis_cache.go`)
|
||
- ⚠️ **No repository caching** - `internal/data/cache` directory is empty
|
||
- ⚠️ No decorator pattern for repositories
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a user browsing popular works,
|
||
I want instant page loads for frequently accessed content,
|
||
So that I have a smooth, responsive experience.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Cache hot works (top 100 viewed)
|
||
- [ ] Cache author profiles
|
||
- [ ] Cache search results (5 min TTL)
|
||
- [ ] Cache translations by work ID
|
||
- [ ] Automatic cache invalidation on updates
|
||
- [ ] Cache hit rate > 70% for reads
|
||
- [ ] Cache warming for popular content
|
||
- [ ] Redis failover doesn't break app
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. [ ] Create `internal/data/cache` decorators
|
||
2. [ ] Create `CachedWorkRepository` decorator
|
||
3. [ ] Create `CachedAuthorRepository` decorator
|
||
4. [ ] Create `CachedTranslationRepository` decorator
|
||
5. [ ] Implement cache-aside pattern
|
||
6. [ ] Add cache key versioning strategy
|
||
7. [ ] Implement selective cache invalidation
|
||
8. [ ] Add cache metrics (hit/miss rates)
|
||
9. [ ] Create cache warming job
|
||
10. [ ] Handle cache failures gracefully
|
||
11. [ ] Document caching strategy
|
||
12. [ ] Add cache configuration
|
||
|
||
**Cache Key Strategy:**
|
||
|
||
```
|
||
work:{version}:{id}
|
||
author:{version}:{id}
|
||
translation:{version}:{work_id}:{lang}
|
||
search:{version}:{query_hash}
|
||
trending:{period}
|
||
```
|
||
|
||
---
|
||
|
||
### Story 4.3: Database Query Optimization
|
||
|
||
**Priority:** P2 (Medium)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `performance`, `database`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a user with slow internet,
|
||
I want database operations to complete quickly,
|
||
So that I don't experience frustrating delays.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] All queries use proper indexes
|
||
- [ ] No N+1 query problems
|
||
- [ ] Eager loading for related entities
|
||
- [ ] Query time < 50ms for 95th percentile
|
||
- [ ] Connection pool properly sized
|
||
- [ ] Slow query logging enabled
|
||
- [ ] Query explain plans documented
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Audit all repository queries
|
||
2. Add missing database indexes
|
||
3. Implement eager loading with GORM Preload
|
||
4. Fix N+1 queries in GraphQL resolvers
|
||
5. Optimize joins and subqueries
|
||
6. Add query timeouts
|
||
7. Configure connection pool settings
|
||
8. Enable PostgreSQL slow query log
|
||
9. Create query performance dashboard
|
||
10. Document query optimization patterns
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 5: Deployment & DevOps (CRITICAL FOR PRODUCTION)
|
||
|
||
### Story 5.1: Production Deployment Automation
|
||
|
||
**Priority:** P0 (Critical)
|
||
**Estimate:** 8 story points (2-3 days)
|
||
**Labels:** `devops`, `deployment`, `infrastructure`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a DevOps engineer,
|
||
I want automated, zero-downtime deployments to production,
|
||
So that we can ship features safely and frequently.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Automated deployment on tag push
|
||
- [ ] Blue-green or rolling deployment strategy
|
||
- [ ] Health checks before traffic routing
|
||
- [ ] Automatic rollback on failures
|
||
- [ ] Database migrations run automatically
|
||
- [ ] Smoke tests after deployment
|
||
- [ ] Deployment notifications (Slack/Discord)
|
||
- [ ] Deployment dashboard
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Complete `.github/workflows/deploy.yml` implementation
|
||
2. Set up staging environment
|
||
3. Implement blue-green deployment strategy
|
||
4. Add health check endpoints (`/health`, `/ready`)
|
||
5. Create database migration runner
|
||
6. Add pre-deployment smoke tests
|
||
7. Configure load balancer for zero-downtime
|
||
8. Set up deployment notifications
|
||
9. Create rollback procedures
|
||
10. Document deployment process
|
||
|
||
**Health Check Endpoints:**
|
||
|
||
```go
|
||
GET /health -> {"status": "ok", "version": "1.2.3"}
|
||
GET /ready -> {"ready": true, "db": "ok", "redis": "ok"}
|
||
GET /metrics -> Prometheus metrics
|
||
```
|
||
|
||
---
|
||
|
||
### Story 5.2: Infrastructure as Code (Kubernetes)
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 8 story points (2-3 days)
|
||
**Labels:** `devops`, `infrastructure`, `k8s`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a platform engineer,
|
||
I want all infrastructure defined as code,
|
||
So that environments are reproducible and version-controlled.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Kubernetes manifests for all services
|
||
- [ ] Helm charts for easy deployment
|
||
- [ ] ConfigMaps for configuration
|
||
- [ ] Secrets management with sealed secrets
|
||
- [ ] Horizontal Pod Autoscaling configured
|
||
- [ ] Ingress with TLS termination
|
||
- [ ] Persistent volumes for PostgreSQL/Redis
|
||
- [ ] Network policies for security
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Enhance `deploy/k8s` manifests
|
||
2. Create Deployment YAML for backend
|
||
3. Create Service and Ingress YAMLs
|
||
4. Create ConfigMap for app configuration
|
||
5. Set up Sealed Secrets for sensitive data
|
||
6. Create HorizontalPodAutoscaler
|
||
7. Add resource limits and requests
|
||
8. Create StatefulSets for databases
|
||
9. Set up persistent volume claims
|
||
10. Create Helm chart structure
|
||
11. Document Kubernetes deployment
|
||
|
||
**File Structure:**
|
||
|
||
```
|
||
deploy/k8s/
|
||
├── base/
|
||
│ ├── deployment.yaml
|
||
│ ├── service.yaml
|
||
│ ├── ingress.yaml
|
||
│ ├── configmap.yaml
|
||
│ └── hpa.yaml
|
||
├── overlays/
|
||
│ ├── staging/
|
||
│ └── production/
|
||
└── helm/
|
||
└── tercul-backend/
|
||
├── Chart.yaml
|
||
├── values.yaml
|
||
└── templates/
|
||
```
|
||
|
||
---
|
||
|
||
### Story 5.3: Disaster Recovery & Backups
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `devops`, `backup`, `disaster-recovery`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a business owner,
|
||
I want automated backups and disaster recovery procedures,
|
||
So that we never lose user data or have extended outages.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Daily PostgreSQL backups
|
||
- [ ] Point-in-time recovery capability
|
||
- [ ] Backup retention policy (30 days)
|
||
- [ ] Backup restoration tested monthly
|
||
- [ ] Backup encryption at rest
|
||
- [ ] Off-site backup storage
|
||
- [ ] Disaster recovery runbook
|
||
- [ ] RTO < 1 hour, RPO < 15 minutes
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Set up automated database backups
|
||
2. Configure WAL archiving for PostgreSQL
|
||
3. Implement backup retention policy
|
||
4. Store backups in S3/GCS with encryption
|
||
5. Create backup restoration script
|
||
6. Test restoration procedure
|
||
7. Create disaster recovery runbook
|
||
8. Set up backup monitoring and alerts
|
||
9. Document backup procedures
|
||
10. Schedule regular DR drills
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 6: Security Hardening (HIGH PRIORITY)
|
||
|
||
### Story 6.1: Security Audit & Vulnerability Scanning
|
||
|
||
**Priority:** P0 (Critical)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `security`, `compliance`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a security officer,
|
||
I want continuous vulnerability scanning and security best practices,
|
||
So that user data and the platform remain secure.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Dependency scanning with Dependabot (already active)
|
||
- [ ] SAST scanning with CodeQL
|
||
- [ ] Container scanning with Trivy
|
||
- [ ] No high/critical vulnerabilities
|
||
- [ ] Security headers configured
|
||
- [ ] Rate limiting on all endpoints
|
||
- [ ] Input validation on all mutations
|
||
- [ ] SQL injection prevention verified
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Review existing security workflows (already good!)
|
||
2. Add rate limiting middleware
|
||
3. Implement input validation with go-playground/validator
|
||
4. Add security headers middleware
|
||
5. Audit SQL queries for injection risks
|
||
6. Review JWT implementation for best practices
|
||
7. Add CSRF protection for mutations
|
||
8. Implement request signing for sensitive operations
|
||
9. Create security incident response plan
|
||
10. Document security practices
|
||
|
||
**Security Headers:**
|
||
|
||
```
|
||
X-Frame-Options: DENY
|
||
X-Content-Type-Options: nosniff
|
||
X-XSS-Protection: 1; mode=block
|
||
Strict-Transport-Security: max-age=31536000
|
||
Content-Security-Policy: default-src 'self'
|
||
```
|
||
|
||
---
|
||
|
||
### Story 6.2: API Rate Limiting & Throttling
|
||
|
||
**Priority:** P1 (High)
|
||
**Estimate:** 3 story points (1 day)
|
||
**Labels:** `security`, `performance`, `api`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a platform operator,
|
||
I want rate limiting to prevent abuse and ensure fair usage,
|
||
So that all users have a good experience and our infrastructure isn't overwhelmed.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Rate limiting per user (authenticated)
|
||
- [ ] Rate limiting per IP (anonymous)
|
||
- [ ] Different limits for different operations
|
||
- [ ] 429 status code with retry-after header
|
||
- [ ] Rate limit info in response headers
|
||
- [ ] Configurable rate limits
|
||
- [ ] Redis-based distributed rate limiting
|
||
- [ ] Rate limit metrics and monitoring
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Implement rate limiting middleware
|
||
2. Use redis for distributed rate limiting
|
||
3. Configure different limits for read/write
|
||
4. Add rate limit headers to responses
|
||
5. Create rate limit exceeded error handling
|
||
6. Add rate limit bypass for admins
|
||
7. Monitor rate limit usage
|
||
8. Document rate limits in API docs
|
||
9. Add tests for rate limiting
|
||
10. Create rate limit dashboard
|
||
|
||
**Rate Limits:**
|
||
|
||
```
|
||
Authenticated Users:
|
||
- 1000 requests/hour (general)
|
||
- 100 writes/hour (mutations)
|
||
- 10 searches/minute
|
||
|
||
Anonymous Users:
|
||
- 100 requests/hour
|
||
- 10 writes/hour
|
||
- 5 searches/minute
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 EPIC 7: Developer Experience (MEDIUM PRIORITY)
|
||
|
||
### Story 7.1: Local Development Environment Improvements
|
||
|
||
**Priority:** P2 (Medium)
|
||
**Estimate:** 3 story points (1 day)
|
||
**Labels:** `devex`, `tooling`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a developer,
|
||
I want a fast, reliable local development environment,
|
||
So that I can iterate quickly without friction.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] One-command setup (`make setup`)
|
||
- [ ] Hot reload for Go code changes
|
||
- [ ] Database seeding with realistic data
|
||
- [ ] GraphQL Playground pre-configured
|
||
- [ ] All services start reliably
|
||
- [ ] Clear error messages when setup fails
|
||
- [ ] Development docs up-to-date
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Create comprehensive `make setup` target
|
||
2. Add air for hot reload in docker-compose
|
||
3. Create database seeding script
|
||
4. Add sample data fixtures
|
||
5. Pre-configure GraphQL Playground
|
||
6. Add health check script
|
||
7. Improve error messages in Makefile
|
||
8. Document common setup issues
|
||
9. Create troubleshooting guide
|
||
10. Add setup validation script
|
||
|
||
---
|
||
|
||
### Story 7.2: Testing Infrastructure Improvements
|
||
|
||
**Priority:** P2 (Medium)
|
||
**Estimate:** 5 story points (1-2 days)
|
||
**Labels:** `testing`, `devex`
|
||
|
||
**User Story:**
|
||
|
||
```
|
||
As a developer writing tests,
|
||
I want fast, reliable test execution without external dependencies,
|
||
So that I can practice TDD effectively.
|
||
```
|
||
|
||
**Acceptance Criteria:**
|
||
|
||
- [ ] Unit tests run in <5 seconds
|
||
- [ ] Integration tests isolated with test containers
|
||
- [ ] Parallel test execution
|
||
- [ ] Test coverage reports
|
||
- [ ] Fixtures for common test scenarios
|
||
- [ ] Clear test failure messages
|
||
- [ ] Easy to run single test or package
|
||
|
||
**Technical Tasks:**
|
||
|
||
1. Refactor `internal/testutil` for better isolation
|
||
2. Implement test containers for integration tests
|
||
3. Add parallel test execution
|
||
4. Create reusable test fixtures
|
||
5. Set up coverage reporting
|
||
6. Add golden file testing utilities
|
||
7. Create test data builders
|
||
8. Improve test naming conventions
|
||
9. Document testing best practices
|
||
10. Add `make test-fast` and `make test-all`
|
||
|
||
---
|
||
|
||
## 📋 Task Summary & Prioritization
|
||
|
||
### Sprint 1 (Week 1): Critical Production Readiness
|
||
|
||
1. **Search Implementation** (Story 1.1) - 8 pts
|
||
2. **Distributed Tracing** (Story 3.1) - 8 pts
|
||
3. **Prometheus Metrics** (Story 3.2) - 5 pts
|
||
4. **Total:** 21 points
|
||
|
||
### Sprint 2 (Week 2): Performance & Documentation
|
||
|
||
1. **API Documentation** (Story 2.1) - 5 pts
|
||
2. **Read Models/DTOs** (Story 4.1) - 8 pts
|
||
3. **Redis Caching** (Story 4.2) - 5 pts
|
||
4. **Structured Logging** (Story 3.3) - 3 pts
|
||
5. **Total:** 21 points
|
||
|
||
### Sprint 3 (Week 3): Deployment & Security
|
||
|
||
1. **Production Deployment** (Story 5.1) - 8 pts
|
||
2. **Security Audit** (Story 6.1) - 5 pts
|
||
3. **Rate Limiting** (Story 6.2) - 3 pts
|
||
4. **Developer Docs** (Story 2.2) - 3 pts
|
||
5. **Total:** 19 points
|
||
|
||
### Sprint 4 (Week 4): Infrastructure & Polish
|
||
|
||
1. **Kubernetes IaC** (Story 5.2) - 8 pts
|
||
2. **Disaster Recovery** (Story 5.3) - 5 pts
|
||
3. **Advanced Search Filters** (Story 1.2) - 5 pts
|
||
4. **Total:** 18 points
|
||
|
||
### Sprint 5 (Week 5): Optimization & DevEx
|
||
|
||
1. **Database Optimization** (Story 4.3) - 5 pts
|
||
2. **Local Dev Environment** (Story 7.1) - 3 pts
|
||
3. **Testing Infrastructure** (Story 7.2) - 5 pts
|
||
4. **Total:** 13 points
|
||
|
||
## 🎯 Success Metrics
|
||
|
||
### Performance SLOs
|
||
|
||
- API response time p95 < 200ms
|
||
- Search response time p95 < 300ms
|
||
- Database query time p95 < 50ms
|
||
- Cache hit rate > 70%
|
||
|
||
### Reliability SLOs
|
||
|
||
- Uptime > 99.9% (< 8.7 hours downtime/year)
|
||
- Error rate < 0.1%
|
||
- Mean Time To Recovery < 1 hour
|
||
- Zero data loss
|
||
|
||
### Developer Experience
|
||
|
||
- Setup time < 15 minutes
|
||
- Test suite runs < 2 minutes
|
||
- Build time < 1 minute
|
||
- Documentation completeness > 90%
|
||
|
||
---
|
||
|
||
**Next Steps:**
|
||
|
||
1. Review and prioritize these tasks with the team
|
||
2. Create GitHub issues for Sprint 1 tasks
|
||
3. Add tasks to project board
|
||
4. Begin implementation starting with search and observability
|
||
|
||
**This is a realistic, achievable roadmap based on the ACTUAL current state of the codebase!** 🚀
|