turash/bugulma/backend/DATA_OPTIMIZATION_REPORT.md

# Data Structure Optimization Report

*Generated: November 23, 2025*

## Executive Summary

This report analyzes the current data architecture across PostgreSQL and Neo4j databases for the city resource graph platform. The analysis identifies opportunities to improve performance, data integrity, and query efficiency through better indexing, relationship modeling, and architectural decisions.

**Current State:**

- PostgreSQL: 20 tables, 64 indexes
- Neo4j: 7 node types, 3 relationship types, 17 indexes
- Data Volume: 1,076 organizations, 9,133 sites, 9,862 addresses

---

## 1. PostgreSQL Database Analysis

### 1.1 Missing Critical Indexes

#### **High Priority - Performance Critical**

1. **Organizations Table**
   - ❌ **Missing**: GIN index on JSONB fields for advanced queries

     ```sql
     CREATE INDEX idx_org_certifications_gin ON organizations USING GIN (certifications);
     CREATE INDEX idx_org_business_focus_gin ON organizations USING GIN (business_focus);
     CREATE INDEX idx_org_technical_expertise_gin ON organizations USING GIN (technical_expertise);
     CREATE INDEX idx_org_products_gin ON organizations USING GIN (products);
     CREATE INDEX idx_org_sells_products_gin ON organizations USING GIN (sells_products);
     CREATE INDEX idx_org_offers_services_gin ON organizations USING GIN (offers_services);
     ```

   - **Impact**: Enables efficient filtering by certification, expertise, products/services
   - **Use Case**: "Find all organizations offering HVAC maintenance services"

   - ❌ **Missing**: Composite index for supply chain queries

     ```sql
     CREATE INDEX idx_org_supply_chain_sector ON organizations (supply_chain_role, industrial_sector);
     ```

   - **Impact**: Optimizes supply chain network queries

   - ❌ **Missing**: Index on trust_score for partner discovery

     ```sql
     CREATE INDEX idx_org_trust_score ON organizations (trust_score DESC);
     ```

   - **Impact**: Enables efficient filtering by trust level

2. **Sites Table**
   - ❌ **Missing**: Composite index for site search

     ```sql
     CREATE INDEX idx_sites_type_ownership ON sites (site_type, ownership);
     ```

   - **Impact**: Optimizes queries for available/leased industrial sites

   - ❌ **Missing**: GIN indexes on JSONB fields

     ```sql
     CREATE INDEX idx_sites_utilities_gin ON sites USING GIN (available_utilities);
     CREATE INDEX idx_sites_waste_mgmt_gin ON sites USING GIN (waste_management);
     ```

   - **Impact**: Enables filtering by specific utility availability

   - ❌ **Missing**: Functional index for available capacity

     ```sql
     CREATE INDEX idx_sites_has_space ON sites ((floor_area_m2 > 0)) WHERE floor_area_m2 IS NOT NULL;
     ```

3. **Resource Flows Table**
   - ❌ **Missing**: Critical composite index for matching algorithm

     ```sql
     CREATE INDEX idx_rf_matching ON resource_flows (type, direction, precision_level)
     WHERE precision_level IN ('measured', 'estimated');
     ```

   - **Impact**: Core matching query optimization (currently no data, but critical for future)

   - ❌ **Missing**: GIN indexes on JSONB fields

     ```sql
     CREATE INDEX idx_rf_quality_gin ON resource_flows USING GIN (quality);
     CREATE INDEX idx_rf_constraints_gin ON resource_flows USING GIN (constraints);
     ```

4. **Addresses Table**
   - ❌ **Missing**: Full-text search index

     ```sql
     CREATE INDEX idx_addresses_formatted_ru_trgm ON addresses USING GIN (formatted_ru gin_trgm_ops);
     ```

   - **Requires**: `CREATE EXTENSION IF NOT EXISTS pg_trgm;`
   - **Impact**: Enables fuzzy address search

   - ❌ **Missing**: Composite index for city/region queries

     ```sql
     CREATE INDEX idx_addresses_city_region ON addresses (city, region);
     ```

#### **Medium Priority - Query Optimization**

5. **Shared Assets Table**
   - ❌ **Missing**: Index for availability queries

     ```sql
     CREATE INDEX idx_shared_assets_available ON shared_assets (type, utilization_rate)
     WHERE operational_status = 'operational' AND utilization_rate < 1.0;
     ```

   - **Impact**: Quickly find available shared equipment

   - ❌ **Missing**: GIN index on current users

     ```sql
     CREATE INDEX idx_shared_assets_users_gin ON shared_assets USING GIN (current_users);
     ```

6. **Matches Table**
   - ❌ **Missing**: Composite index for active matches

     ```sql
     CREATE INDEX idx_matches_active ON matches (status, priority DESC, compatibility_score DESC)
     WHERE status IN ('suggested', 'negotiating', 'reserved');
     ```

   - ❌ **Missing**: Index for expiring reservations

     ```sql
     CREATE INDEX idx_matches_expiring ON matches (reserved_until)
     WHERE reserved_until IS NOT NULL AND status = 'reserved';
     ```

---

### 1.2 Redundant or Inefficient Indexes

1. **Organizations Table**
   - ⚠️ **Redundant**: `idx_organizations_created_at` - Low cardinality, rarely used alone
   - **Recommendation**: Consider dropping if not used for time-series analysis

2. **Sites Table**
   - ⚠️ **Dual Spatial Indexes**: Both `idx_site_location` (btree) and `idx_site_geometry` (gist)
   - **Recommendation**: Keep GIST for PostGIS operations, drop btree if not needed for exact lookups

---

### 1.3 Missing Constraints and Data Integrity

1. **Foreign Key Constraints** ✅ Good coverage
   - All major relationships have FK constraints

2. **Check Constraints Needed**

   ```sql
   -- Organizations
   ALTER TABLE organizations ADD CONSTRAINT chk_org_trust_score
   CHECK (trust_score >= 0 AND trust_score <= 1);

   ALTER TABLE organizations ADD CONSTRAINT chk_org_company_size
   CHECK (company_size >= 0);

   -- Sites
   ALTER TABLE sites ADD CONSTRAINT chk_site_floor_area
   CHECK (floor_area_m2 >= 0);

   ALTER TABLE sites ADD CONSTRAINT chk_site_capacity
   CHECK (crane_capacity_tonnes >= 0);

   -- Resource Flows
   ALTER TABLE resource_flows ADD CONSTRAINT chk_rf_valid_direction
   CHECK (direction IN ('input', 'output'));

   -- Shared Assets
   ALTER TABLE shared_assets ADD CONSTRAINT chk_asset_capacity
   CHECK (capacity >= 0);
   ```

3. **Partial Unique Constraints Needed**

   ```sql
   -- Prevent duplicate active matches for same resource pairs
   CREATE UNIQUE INDEX idx_matches_unique_active ON matches (source_resource_id, target_resource_id)
   WHERE status IN ('negotiating', 'reserved', 'contracted', 'live');
   ```

---

## 2. Neo4j Graph Database Analysis

### 2.1 Missing Critical Indexes

#### **High Priority**

1. **Organization Nodes**
   - ❌ **Missing**: Full-text search index

     ```cypher
     CREATE FULLTEXT INDEX organization_search_idx FOR (o:Organization) ON EACH [o.name, o.description];
     ```

   - ❌ **Missing**: Composite index for sector searches

     ```cypher
     CREATE INDEX org_sector_subtype_idx FOR (o:Organization) ON (o.sector, o.subtype);
     ```

   - ❌ **Missing**: Point index for spatial queries (currently using lat/long separately)

     ```cypher
     // After adding point property to nodes
     CREATE POINT INDEX org_location_point_idx FOR (o:Organization) ON (o.location);
     ```

2. **Site Nodes**
   - ❌ **Missing**: Point index for efficient spatial operations

     ```cypher
     CREATE POINT INDEX site_location_point_idx FOR (s:Site) ON (s.location);
     ```

   - **Impact**: Dramatically improves nearby site queries

3. **Address Nodes**
   - ❌ **Missing**: ALL indexes!

     ```cypher
     CREATE INDEX address_city_idx FOR (a:Address) ON (a.city);
     CREATE INDEX address_region_idx FOR (a:Address) ON (a.region);
     CREATE FULLTEXT INDEX address_search_idx FOR (a:Address) ON EACH [a.formatted_ru, a.formatted_en];
     ```

4. **Resource Flow Nodes**
   - ❌ **Missing**: Composite index for matching

     ```cypher
     CREATE INDEX rf_org_site_idx FOR (rf:ResourceFlow) ON (rf.organization_id, rf.site_id);
     ```

---

### 2.2 Missing Critical Relationships

#### **High Priority - Essential for Graph Queries**

1. **Address Relationships**
   - ❌ **LOCATED_AT exists but not populated!** (0 relationships currently)
   - **Should Have**: Organization→Address, Site→Address
   - **Fix Required**: Update graph sync to create these relationships

   ```cypher
   // Example of what should exist:
   MATCH (o:Organization {id: $org_id}), (a:Address {id: $addr_id})
   CREATE (o)-[:LOCATED_AT]->(a)
   ```

2. **Resource Flow Relationships**
   - ❌ **HOSTS**: Site→ResourceFlow (not yet created)

   ```cypher
   CREATE (s:Site)-[:HOSTS]->(rf:ResourceFlow)
   ```

   - ❌ **PRODUCES/CONSUMES**: Organization→ResourceFlow (semantic clarity)

   ```cypher
   CREATE (o:Organization)-[:PRODUCES]->(rf:ResourceFlow {direction: 'output'})
   CREATE (o:Organization)-[:CONSUMES]->(rf:ResourceFlow {direction: 'input'})
   ```

3. **Match Relationships**
   - ❌ **MATCHES**: ResourceFlow→ResourceFlow (not yet created)

   ```cypher
   CREATE (source:ResourceFlow)-[:MATCHES {score: 0.85}]->(target:ResourceFlow)
   ```

4. **Supply Chain Relationships**
   - ❌ **SUPPLIES**: Organization→Organization (supplier network)
   - ❌ **COLLABORATES_WITH**: Organization→Organization (existing partnerships)
   - ❌ **TRUSTS**: Organization→Organization (trust network from JSONB field)

   ```cypher
   CREATE (o1:Organization)-[:SUPPLIES {products: ['heat', 'steam']}]->(o2:Organization)
   CREATE (o1:Organization)-[:TRUSTS {score: 0.9}]->(o2:Organization)
   ```

5. **Shared Asset Relationships**
   - ❌ **OWNS_ASSET**: Organization→SharedAsset
   - ❌ **USES_ASSET**: Organization→SharedAsset
   - ❌ **HAS_ASSET**: Site→SharedAsset

   ```cypher
   CREATE (o:Organization)-[:OWNS_ASSET]->(sa:SharedAsset)
   CREATE (o:Organization)-[:USES_ASSET {since: date()}]->(sa:SharedAsset)
   ```

6. **Spatial Proximity Relationships** (Advanced)
   - 🔮 **NEAR**: Organization→Organization / Site→Site (for proximity analysis)

   ```cypher
   // Create relationships for entities within 5km
   MATCH (s1:Site), (s2:Site)
   WHERE s1.id < s2.id
   AND point.distance(s1.location, s2.location) < 5000
   CREATE (s1)-[:NEAR {distance_m: point.distance(s1.location, s2.location)}]->(s2)
   ```

#### **Medium Priority - Enhanced Analytics**

7. **Temporal Relationships**
   - 🔮 **OPERATED_AT**: Organization→Site (with time range)

   ```cypher
   CREATE (o:Organization)-[:OPERATED_AT {from: date('2020-01-01'), to: date('2023-12-31')}]->(s:Site)
   ```

8. **Categorical Relationships**
   - 🔮 **IN_SECTOR**: Organization→Sector (for hierarchical sector queries)
   - 🔮 **OF_TYPE**: Site→SiteType (for type-based traversal)

---

### 2.3 Graph Schema Improvements

#### **Property Graph Enhancements**

1. **Add Point Properties for Spatial Queries**

   ```cypher
   // Instead of storing lat/long separately, use Neo4j Point type
   MATCH (o:Organization)
   WHERE o.latitude IS NOT NULL AND o.longitude IS NOT NULL
   SET o.location = point({latitude: o.latitude, longitude: o.longitude})

   MATCH (s:Site)
   WHERE s.latitude IS NOT NULL AND s.longitude IS NOT NULL
   SET s.location = point({latitude: s.latitude, longitude: s.longitude})
   ```

2. **Add Computed Properties**

   ```cypher
   // Add degree centrality for network analysis
   MATCH (o:Organization)
   SET o.connection_count = size((o)-[:OPERATES_AT|SUPPLIES|COLLABORATES_WITH]-())

   // Add resource diversity score
   MATCH (o:Organization)
   SET o.resource_types_count = size([
     (o)-[:PRODUCES|CONSUMES]->(rf:ResourceFlow) | DISTINCT rf.type
   ])
   ```

---

## 3. Data Architecture Recommendations

### 3.1 PostgreSQL Optimizations

#### **Immediate Actions (High ROI)**

1. **Enable Required Extensions**

   ```sql
   CREATE EXTENSION IF NOT EXISTS pg_trgm;  -- Fuzzy text search
   CREATE EXTENSION IF NOT EXISTS btree_gin; -- Multi-column GIN indexes
   ```

2. **Create Missing Indexes (Priority Order)**
   - Organizations: JSONB GIN indexes (products, services, certifications)
   - Resource Flows: Matching composite index
   - Addresses: Trigram index for fuzzy search
   - Sites: Utilities GIN index
   - Matches: Active matches composite index

3. **Partition Large Tables** (When scale increases)

   ```sql
   -- Partition resource_flows by created_at (monthly partitions)
   CREATE TABLE resource_flows_partitioned (
     LIKE resource_flows INCLUDING ALL
   ) PARTITION BY RANGE (created_at);

   -- Create partitions
   CREATE TABLE resource_flows_2025_11 PARTITION OF resource_flows_partitioned
   FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
   ```

4. **Materialized Views for Common Queries**

   ```sql
   -- Active matches dashboard
   CREATE MATERIALIZED VIEW mv_active_matches AS
   SELECT
     m.*,
     so.name as source_org_name,
     to.name as target_org_name,
     srf.type as resource_type
   FROM matches m
   JOIN resource_flows srf ON m.source_resource_id = srf.id
   JOIN resource_flows trf ON m.target_resource_id = trf.id
   JOIN organizations so ON srf.organization_id = so.id
   JOIN organizations to ON trf.organization_id = to.id
   WHERE m.status IN ('suggested', 'negotiating', 'reserved', 'live')
   WITH DATA;

   CREATE UNIQUE INDEX ON mv_active_matches (id);
   CREATE INDEX ON mv_active_matches (compatibility_score DESC);

   -- Refresh strategy (can be automated with cron)
   REFRESH MATERIALIZED VIEW CONCURRENTLY mv_active_matches;
   ```

#### **Medium-Term Optimizations**

5. **JSONB Field Normalization** (When querying becomes complex)
   - Consider extracting frequently queried JSONB fields to columns
   - Example: `products` JSONB → separate `organization_products` table

6. **Archive Old Data**

   ```sql
   -- Archive old matches to separate table
   CREATE TABLE matches_archive (LIKE matches INCLUDING ALL);

   INSERT INTO matches_archive
   SELECT * FROM matches
   WHERE status IN ('failed', 'cancelled', 'completed')
   AND updated_at < NOW() - INTERVAL '1 year';

   DELETE FROM matches
   WHERE id IN (SELECT id FROM matches_archive);
   ```

---

### 3.2 Neo4j Optimizations

#### **Immediate Actions**

1. **Populate Missing Relationships**
   - Fix Address sync to create LOCATED_AT relationships
   - Add HOSTS relationships during ResourceFlow sync
   - Create MATCHES relationships during Match sync

2. **Create Critical Indexes**

   ```cypher
   // Full-text search
   CREATE FULLTEXT INDEX organization_search FOR (o:Organization) ON EACH [o.name, o.description];
   CREATE FULLTEXT INDEX site_search FOR (s:Site) ON EACH [s.name, s.current_use];

   // Spatial
   CREATE POINT INDEX org_location FOR (o:Organization) ON (o.location);
   CREATE POINT INDEX site_location FOR (s:Site) ON (s.location);

   // Address
   CREATE INDEX address_city FOR (a:Address) ON (a.city);
   ```

3. **Add Point Properties**

   ```cypher
   // Convert lat/lng to Point type for efficient spatial queries
   MATCH (n:Organization)
   WHERE n.latitude IS NOT NULL
   SET n.location = point({latitude: n.latitude, longitude: n.longitude});

   MATCH (n:Site)
   WHERE n.latitude IS NOT NULL
   SET n.location = point({latitude: n.latitude, longitude: n.longitude});
   ```

#### **Enhanced Graph Structure**

4. **Create Proximity Relationships**

   ```cypher
   // Create NEAR relationships for sites within 5km
   MATCH (s1:Site), (s2:Site)
   WHERE s1.id < s2.id
   AND point.distance(s1.location, s2.location) < 5000
   CREATE (s1)-[:NEAR {
     distance_m: point.distance(s1.location, s2.location),
     created_at: datetime()
   }]->(s2);
   ```

5. **Extract Network from JSONB**

   ```cypher
   // Create TRUSTS relationships from trust_network JSONB field
   MATCH (o:Organization)
   WHERE o.trust_network IS NOT NULL
   UNWIND o.trust_network AS trusted_id
   MATCH (trusted:Organization {id: trusted_id})
   MERGE (o)-[:TRUSTS {score: 0.8}]->(trusted);
   ```

6. **Create Hierarchical Structures**

   ```cypher
   // Create Sector nodes for better taxonomy queries
   CREATE (:Sector {name: 'manufacturing', level: 'primary'})
   CREATE (:Sector {name: 'oil_and_gas', level: 'secondary', parent: 'manufacturing'})

   // Connect organizations to sectors
   MATCH (o:Organization {sector: 'oil_and_gas'})
   MATCH (s:Sector {name: 'oil_and_gas'})
   CREATE (o)-[:IN_SECTOR]->(s);
   ```

---

## 4. Query Performance Examples

### 4.1 Before Optimization

**Query: Find potential heat matches within 10km**

```cypher
// Current (inefficient)
MATCH (source:ResourceFlow {type: 'heat', direction: 'output'})
MATCH (target:ResourceFlow {type: 'heat', direction: 'input'})
MATCH (source)-[:HOSTS]-(ss:Site)
MATCH (target)-[:HOSTS]-(ts:Site)
WHERE point.distance(
  point({latitude: ss.latitude, longitude: ss.longitude}),
  point({latitude: ts.latitude, longitude: ts.longitude})
) < 10000
RETURN source, target
```

**Estimated**: 5-10 seconds on 10,000 flows

### 4.2 After Optimization

```cypher
// Optimized with Point properties and indexes
MATCH (source:ResourceFlow {type: 'heat', direction: 'output'})
MATCH (target:ResourceFlow {type: 'heat', direction: 'input'})
MATCH (source)<-[:HOSTS]-(ss:Site)
MATCH (target)<-[:HOSTS]-(ts:Site)
WHERE point.distance(ss.location, ts.location) < 10000
RETURN source, target, point.distance(ss.location, ts.location) AS distance
ORDER BY distance
```

**Estimated**: <1 second with spatial index

---

## 5. Implementation Priority Matrix

### Phase 1: Critical (Week 1)

1. ✅ Create PostgreSQL GIN indexes on JSONB fields
2. ✅ Add Point properties to Neo4j nodes
3. ✅ Create Neo4j spatial indexes
4. ✅ Fix Address LOCATED_AT relationship sync
5. ✅ Add full-text search indexes in Neo4j

### Phase 2: Important (Week 2-3)

6. ✅ Create HOSTS relationships (Site→ResourceFlow)
7. ✅ Create MATCHES relationships
8. ✅ Add composite indexes in PostgreSQL
9. ✅ Create SUPPLIES/TRUSTS relationships
10. ✅ Add check constraints

### Phase 3: Optimization (Month 2)

11. ⏰ Create materialized views
12. ⏰ Add NEAR proximity relationships
13. ⏰ Implement sector hierarchy
14. ⏰ Archive old data

### Phase 4: Advanced (Month 3+)

15. 🔮 Partition large tables
16. 🔮 Normalize complex JSONB fields
17. 🔮 Add graph analytics (PageRank, Community Detection)
18. 🔮 Implement time-series partitioning

---

## 6. Expected Performance Improvements

### Query Performance

| Query Type | Before | After | Improvement |
|-----------|--------|-------|-------------|
| Find organizations by product | Full scan | Index scan | 100x faster |
| Spatial proximity (10km radius) | 5-10s | <100ms | 50-100x faster |
| Match discovery | N/A | <500ms | New capability |
| Full-text search | Not possible | <200ms | New capability |
| Supply chain traversal | N/A | <1s (3 hops) | New capability |

### Storage Efficiency

| Aspect | Current | Optimized | Benefit |
|--------|---------|-----------|---------|
| Index size | ~50MB | ~150MB | Better query performance |
| Redundant data | High (JSONB overlap) | Medium | Consider normalization later |
| Graph density | Low (3 rel types) | High (10+ rel types) | Richer analytics |

---

## 7. Monitoring Recommendations

### PostgreSQL

```sql
-- Identify slow queries
CREATE EXTENSION pg_stat_statements;

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Find unused indexes
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey';
```

### Neo4j

```cypher
// Query profiling
PROFILE
MATCH (o:Organization)-[:OPERATES_AT]->(s:Site)
WHERE point.distance(o.location, point({latitude: 54.5, longitude: 52.3})) < 5000
RETURN o, s;

// Check index usage
CALL db.stats.retrieve('QUERIES');
```

---

## 8. Next Steps

1. **Review this report** with the development team
2. **Prioritize implementations** based on immediate needs
3. **Create migration scripts** for index creation
4. **Update graph sync service** to create missing relationships
5. **Add monitoring** for query performance
6. **Schedule maintenance windows** for index creation
7. **Document new query patterns** for the team

---

## Appendix A: Complete Index Creation Script

See `migrations/add_optimization_indexes.sql` (to be created)

## Appendix B: Neo4j Relationship Sync Updates

See `internal/repository/graph_*_repository.go` updates needed

---

*End of Report*