## 16. DevOps & Infrastructure

### Deployment Architecture

#### Application Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│                          Load Balancer                          │
│                           (AWS ALB/NLB)                         │
└─────────────────┬───────────────────────────────────────────────┘
                  │
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼───┐   ┌─────▼─────┐   ┌───▼───┐
│  API  │   │  Worker   │   │  Web  │
│Gateway│   │ Services  │   │ Front │
│(Kong) │   │(Matching) │   │(Next) │
└───┬───┘   └─────┬─────┘   └───────┘
    │             │
┌───▼─────────────▼─────────────────┐
│         Service Mesh              │
│        (Istio/Linkerd)            │
│                                   │
│  ┌─────────────┬─────────────┐    │
│  │  Neo4j      │  PostgreSQL │    │
│  │  Cluster    │  + PostGIS  │    │
│  └─────────────┴─────────────┘    │
│                                   │
│  ┌─────────────────────────────────┐
│  │       Redis Cluster            │
│  │   (Cache + PubSub + Jobs)      │
│  └─────────────────────────────────┘
└───────────────────────────────────┘
```

#### Infrastructure Components

**Production Stack**:
- **Cloud Provider**: AWS (EKS) or Google Cloud (GKE)
- **Kubernetes**: Managed Kubernetes service
- **Load Balancing**: AWS ALB/NLB or GCP Load Balancer
- **CDN**: CloudFront or Cloudflare for static assets
- **Object Storage**: S3 or GCS for backups and assets
- **Monitoring**: Prometheus + Grafana (managed)
- **Logging**: Loki or CloudWatch

**Development Stack**:
- **Local Development**: Docker Compose + Kind (Kubernetes in Docker)
- **CI/CD**: GitHub Actions with self-hosted runners
- **Preview Environments**: Ephemeral environments per PR

### Infrastructure as Code

#### Terraform Configuration Structure
```
infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── staging/
│   │   └── ...
│   └── prod/
│       └── ...
├── modules/
│   ├── eks/
│   ├── rds/
│   ├── elasticache/
│   ├── networking/
│   └── monitoring/
├── shared/
│   ├── providers.tf
│   ├── versions.tf
│   └── backend.tf
└── scripts/
    ├── init.sh
    └── plan.sh
```

#### Core Infrastructure Module
```hcl
# infrastructure/modules/eks/main.tf
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = var.cluster_name
  cluster_version = "1.27"

  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnets

  # Managed node groups
  eks_managed_node_groups = {
    general = {
      instance_types = ["t3.large"]
      min_size       = 1
      max_size       = 10
      desired_size   = 3

      labels = {
        Environment = var.environment
        NodeGroup   = "general"
      }
    }

    matching = {
      instance_types = ["c6i.xlarge"]  # CPU-optimized for matching engine
      min_size       = 2
      max_size       = 20
      desired_size   = 5

      labels = {
        Environment = var.environment
        NodeGroup   = "matching"
      }
    }
  }
}
```

#### Database Infrastructure
```hcl
# infrastructure/modules/database/main.tf
resource "aws_db_instance" "postgresql" {
  identifier             = "${var.environment}-city-resource-graph"
  engine                 = "postgres"
  engine_version         = "15.4"
  instance_class         = "db.r6g.large"
  allocated_storage      = 100
  max_allocated_storage  = 1000
  storage_type           = "gp3"

  # Enable PostGIS
  parameter_group_name = aws_db_parameter_group.postgis.name

  # Multi-AZ for production
  multi_az               = var.environment == "prod"
  backup_retention_period = 30

  # Security
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.database.name

  # Monitoring
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  monitoring_interval             = 60
  monitoring_role_arn            = aws_iam_role.rds_enhanced_monitoring.arn
}

resource "aws_db_parameter_group" "postgis" {
  family = "postgres15"
  name   = "${var.environment}-postgis"

  parameter {
    name  = "shared_preload_libraries"
    value = "postgis"
  }
}
```

### Kubernetes Configuration

#### Application Deployment
```yaml
# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: city-resource-graph-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: city-resource-graph-api
  template:
    metadata:
      labels:
        app: city-resource-graph-api
    spec:
      containers:
      - name: api
        image: cityresourcegraph/api:latest
        ports:
        - containerPort: 8080
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: database-secret
              key: url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: url
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
```

#### Service Mesh Configuration
```yaml
# k8s/base/istio.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: city-resource-graph-api
spec:
  http:
  - match:
    - uri:
        prefix: "/api/v1"
    route:
    - destination:
        host: city-resource-graph-api
        subset: v1
  - match:
    - uri:
        prefix: "/api/v2"
    route:
    - destination:
        host: city-resource-graph-api
        subset: v2
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: city-resource-graph-api
spec:
  host: city-resource-graph-api
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
```

### CI/CD Pipeline

#### GitHub Actions Workflow
```yaml
# .github/workflows/deploy.yml
name: Deploy to Kubernetes

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-go@v4
      with:
        go-version: '1.21'
    - name: Test
      run: |
        go test -v -race -coverprofile=coverage.out ./...
        go tool cover -html=coverage.out -o coverage.html

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
    - name: Checkout repository
      uses: actions/checkout@v4

    - name: Log in to registry
      run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin

    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
    - name: Deploy to Kubernetes
      uses: azure/k8s-deploy@v4
      with:
        namespace: production
        manifests: |
          k8s/production/deployment.yaml
          k8s/production/service.yaml
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
        kubectl-version: latest
```

#### Database Migration Strategy
```yaml
# k8s/jobs/migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: database-migration
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: migrate/migrate:latest
        command: ["migrate", "-path", "/migrations", "-database", "$(DATABASE_URL)", "up"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: database-secret
              key: url
        volumeMounts:
        - name: migrations
          mountPath: /migrations
      volumes:
      - name: migrations
        configMap:
          name: database-migrations
      restartPolicy: Never
```

### Monitoring & Observability

#### Prometheus Configuration
```yaml
# k8s/monitoring/prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: city-resource-graph-alerts
spec:
  groups:
  - name: city-resource-graph
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | printf \"%.2f\" }}%"

    - alert: MatchingEngineSlow
      expr: histogram_quantile(0.95, rate(matching_duration_seconds_bucket[5m])) > 2
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Matching engine is slow"
        description: "95th percentile matching duration is {{ $value | printf \"%.2f\" }}s"
```

#### Grafana Dashboards
- **Application Metrics**: Response times, error rates, throughput
- **Business Metrics**: Match conversion rates, user engagement, revenue
- **Infrastructure Metrics**: CPU/memory usage, database connections, cache hit rates
- **Domain Metrics**: Matching accuracy, economic value calculations

### Security & Compliance

#### Infrastructure Security
```yaml
# k8s/security/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-to-database
spec:
  podSelector:
    matchLabels:
      app: city-resource-graph-api
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgresql
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - podSelector:
        matchLabels:
          app: neo4j
    ports:
    - protocol: TCP
      port: 7687
```

#### Secrets Management
- **AWS Secrets Manager** or **GCP Secret Manager** for production
- **Sealed Secrets** for Kubernetes-native secret management
- **External Secrets Operator** for automatic secret rotation

### Backup & Disaster Recovery

#### Database Backups
```bash
# Daily automated backup
pg_dump --host=$DB_HOST --username=$DB_USER --dbname=$DB_NAME \
        --format=custom --compress=9 --file=/backups/$(date +%Y%m%d_%H%M%S).backup

# Point-in-time recovery capability
# Retention: 30 days for daily, 1 year for weekly
```

#### Disaster Recovery
- **Multi-region deployment** for production
- **Cross-region backup replication**
- **Automated failover** with Route 53 health checks
- **Recovery Time Objective (RTO)**: 4 hours
- **Recovery Point Objective (RPO)**: 1 hour

### Cost Optimization

#### Resource Optimization
```yaml
# k8s/autoscaling/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: city-resource-graph-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: city-resource-graph-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
```

#### Cloud Cost Management
- **Reserved Instances**: 70% of baseline capacity
- **Spot Instances**: For batch processing and development
- **Auto-scaling**: Scale-to-zero for development environments
- **Cost Allocation Tags**: Track costs by service, environment, team

### Documentation

**Technical Documentation**:
1. **API Documentation**: OpenAPI/Swagger specification, interactive API explorer (Swagger UI, ReDoc), code examples
2. **Architecture Diagrams**: C4 model diagrams (Context, Container, Component, Code), sequence diagrams, data flow diagrams, deployment architecture
3. **Runbooks**: Operational procedures, troubleshooting guides, incident response procedures

**Developer Documentation**:
- Getting Started Guide: Local setup, development workflow
- Contributing Guide: Code standards, PR process
- Architecture Decisions: ADR index
- API Client Libraries: SDKs for popular languages

---