turash/docs/concept/18_devops_infrastructure.md
Damir Mukimov 000eab4740
Major repository reorganization and missing backend endpoints implementation
Repository Structure:
- Move files from cluttered root directory into organized structure
- Create archive/ for archived data and scraper results
- Create bugulma/ for the complete application (frontend + backend)
- Create data/ for sample datasets and reference materials
- Create docs/ for comprehensive documentation structure
- Create scripts/ for utility scripts and API tools

Backend Implementation:
- Implement 3 missing backend endpoints identified in gap analysis:
  * GET /api/v1/organizations/{id}/matching/direct - Direct symbiosis matches
  * GET /api/v1/users/me/organizations - User organizations
  * POST /api/v1/proposals/{id}/status - Update proposal status
- Add complete proposal domain model, repository, and service layers
- Create database migration for proposals table
- Fix CLI server command registration issue

API Documentation:
- Add comprehensive proposals.md API documentation
- Update README.md with Users and Proposals API sections
- Document all request/response formats, error codes, and business rules

Code Quality:
- Follow existing Go backend architecture patterns
- Add proper error handling and validation
- Match frontend expected response schemas
- Maintain clean separation of concerns (handler -> service -> repository)
2025-11-25 06:01:16 +01:00

14 KiB

16. DevOps & Infrastructure

Deployment Architecture

Application Architecture

┌─────────────────────────────────────────────────────────────────┐
│                          Load Balancer                          │
│                           (AWS ALB/NLB)                         │
└─────────────────┬───────────────────────────────────────────────┘
                  │
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼───┐   ┌─────▼─────┐   ┌───▼───┐
│  API  │   │  Worker   │   │  Web  │
│Gateway│   │ Services  │   │ Front │
│(Kong) │   │(Matching) │   │(Next) │
└───┬───┘   └─────┬─────┘   └───────┘
    │             │
┌───▼─────────────▼─────────────────┐
│         Service Mesh              │
│        (Istio/Linkerd)            │
│                                   │
│  ┌─────────────┬─────────────┐    │
│  │  Neo4j      │  PostgreSQL │    │
│  │  Cluster    │  + PostGIS  │    │
│  └─────────────┴─────────────┘    │
│                                   │
│  ┌─────────────────────────────────┐
│  │       Redis Cluster            │
│  │   (Cache + PubSub + Jobs)      │
│  └─────────────────────────────────┘
└───────────────────────────────────┘

Infrastructure Components

Production Stack:

  • Cloud Provider: AWS (EKS) or Google Cloud (GKE)
  • Kubernetes: Managed Kubernetes service
  • Load Balancing: AWS ALB/NLB or GCP Load Balancer
  • CDN: CloudFront or Cloudflare for static assets
  • Object Storage: S3 or GCS for backups and assets
  • Monitoring: Prometheus + Grafana (managed)
  • Logging: Loki or CloudWatch

Development Stack:

  • Local Development: Docker Compose + Kind (Kubernetes in Docker)
  • CI/CD: GitHub Actions with self-hosted runners
  • Preview Environments: Ephemeral environments per PR

Infrastructure as Code

Terraform Configuration Structure

infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── staging/
│   │   └── ...
│   └── prod/
│       └── ...
├── modules/
│   ├── eks/
│   ├── rds/
│   ├── elasticache/
│   ├── networking/
│   └── monitoring/
├── shared/
│   ├── providers.tf
│   ├── versions.tf
│   └── backend.tf
└── scripts/
    ├── init.sh
    └── plan.sh

Core Infrastructure Module

# infrastructure/modules/eks/main.tf
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = var.cluster_name
  cluster_version = "1.27"

  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnets

  # Managed node groups
  eks_managed_node_groups = {
    general = {
      instance_types = ["t3.large"]
      min_size       = 1
      max_size       = 10
      desired_size   = 3

      labels = {
        Environment = var.environment
        NodeGroup   = "general"
      }
    }

    matching = {
      instance_types = ["c6i.xlarge"]  # CPU-optimized for matching engine
      min_size       = 2
      max_size       = 20
      desired_size   = 5

      labels = {
        Environment = var.environment
        NodeGroup   = "matching"
      }
    }
  }
}

Database Infrastructure

# infrastructure/modules/database/main.tf
resource "aws_db_instance" "postgresql" {
  identifier             = "${var.environment}-city-resource-graph"
  engine                 = "postgres"
  engine_version         = "15.4"
  instance_class         = "db.r6g.large"
  allocated_storage      = 100
  max_allocated_storage  = 1000
  storage_type           = "gp3"

  # Enable PostGIS
  parameter_group_name = aws_db_parameter_group.postgis.name

  # Multi-AZ for production
  multi_az               = var.environment == "prod"
  backup_retention_period = 30

  # Security
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.database.name

  # Monitoring
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  monitoring_interval             = 60
  monitoring_role_arn            = aws_iam_role.rds_enhanced_monitoring.arn
}

resource "aws_db_parameter_group" "postgis" {
  family = "postgres15"
  name   = "${var.environment}-postgis"

  parameter {
    name  = "shared_preload_libraries"
    value = "postgis"
  }
}

Kubernetes Configuration

Application Deployment

# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: city-resource-graph-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: city-resource-graph-api
  template:
    metadata:
      labels:
        app: city-resource-graph-api
    spec:
      containers:
      - name: api
        image: cityresourcegraph/api:latest
        ports:
        - containerPort: 8080
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: database-secret
              key: url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: url
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Service Mesh Configuration

# k8s/base/istio.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: city-resource-graph-api
spec:
  http:
  - match:
    - uri:
        prefix: "/api/v1"
    route:
    - destination:
        host: city-resource-graph-api
        subset: v1
  - match:
    - uri:
        prefix: "/api/v2"
    route:
    - destination:
        host: city-resource-graph-api
        subset: v2
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: city-resource-graph-api
spec:
  host: city-resource-graph-api
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

CI/CD Pipeline

GitHub Actions Workflow

# .github/workflows/deploy.yml
name: Deploy to Kubernetes

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-go@v4
      with:
        go-version: '1.21'
    - name: Test
      run: |
        go test -v -race -coverprofile=coverage.out ./...
        go tool cover -html=coverage.out -o coverage.html        

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
    - name: Checkout repository
      uses: actions/checkout@v4

    - name: Log in to registry
      run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin

    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
    - name: Deploy to Kubernetes
      uses: azure/k8s-deploy@v4
      with:
        namespace: production
        manifests: |
          k8s/production/deployment.yaml
          k8s/production/service.yaml          
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
        kubectl-version: latest

Database Migration Strategy

# k8s/jobs/migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: database-migration
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: migrate/migrate:latest
        command: ["migrate", "-path", "/migrations", "-database", "$(DATABASE_URL)", "up"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: database-secret
              key: url
        volumeMounts:
        - name: migrations
          mountPath: /migrations
      volumes:
      - name: migrations
        configMap:
          name: database-migrations
      restartPolicy: Never

Monitoring & Observability

Prometheus Configuration

# k8s/monitoring/prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: city-resource-graph-alerts
spec:
  groups:
  - name: city-resource-graph
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | printf \"%.2f\" }}%"

    - alert: MatchingEngineSlow
      expr: histogram_quantile(0.95, rate(matching_duration_seconds_bucket[5m])) > 2
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Matching engine is slow"
        description: "95th percentile matching duration is {{ $value | printf \"%.2f\" }}s"

Grafana Dashboards

  • Application Metrics: Response times, error rates, throughput
  • Business Metrics: Match conversion rates, user engagement, revenue
  • Infrastructure Metrics: CPU/memory usage, database connections, cache hit rates
  • Domain Metrics: Matching accuracy, economic value calculations

Security & Compliance

Infrastructure Security

# k8s/security/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-to-database
spec:
  podSelector:
    matchLabels:
      app: city-resource-graph-api
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgresql
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - podSelector:
        matchLabels:
          app: neo4j
    ports:
    - protocol: TCP
      port: 7687

Secrets Management

  • AWS Secrets Manager or GCP Secret Manager for production
  • Sealed Secrets for Kubernetes-native secret management
  • External Secrets Operator for automatic secret rotation

Backup & Disaster Recovery

Database Backups

# Daily automated backup
pg_dump --host=$DB_HOST --username=$DB_USER --dbname=$DB_NAME \
        --format=custom --compress=9 --file=/backups/$(date +%Y%m%d_%H%M%S).backup

# Point-in-time recovery capability
# Retention: 30 days for daily, 1 year for weekly

Disaster Recovery

  • Multi-region deployment for production
  • Cross-region backup replication
  • Automated failover with Route 53 health checks
  • Recovery Time Objective (RTO): 4 hours
  • Recovery Point Objective (RPO): 1 hour

Cost Optimization

Resource Optimization

# k8s/autoscaling/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: city-resource-graph-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: city-resource-graph-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Cloud Cost Management

  • Reserved Instances: 70% of baseline capacity
  • Spot Instances: For batch processing and development
  • Auto-scaling: Scale-to-zero for development environments
  • Cost Allocation Tags: Track costs by service, environment, team

Documentation

Technical Documentation:

  1. API Documentation: OpenAPI/Swagger specification, interactive API explorer (Swagger UI, ReDoc), code examples
  2. Architecture Diagrams: C4 model diagrams (Context, Container, Component, Code), sequence diagrams, data flow diagrams, deployment architecture
  3. Runbooks: Operational procedures, troubleshooting guides, incident response procedures

Developer Documentation:

  • Getting Started Guide: Local setup, development workflow
  • Contributing Guide: Code standards, PR process
  • Architecture Decisions: ADR index
  • API Client Libraries: SDKs for popular languages