# Data Structure Optimization Report *Generated: November 23, 2025* ## Executive Summary This report analyzes the current data architecture across PostgreSQL and Neo4j databases for the city resource graph platform. The analysis identifies opportunities to improve performance, data integrity, and query efficiency through better indexing, relationship modeling, and architectural decisions. **Current State:** - PostgreSQL: 20 tables, 64 indexes - Neo4j: 7 node types, 3 relationship types, 17 indexes - Data Volume: 1,076 organizations, 9,133 sites, 9,862 addresses --- ## 1. PostgreSQL Database Analysis ### 1.1 Missing Critical Indexes #### **High Priority - Performance Critical** 1. **Organizations Table** - ❌ **Missing**: GIN index on JSONB fields for advanced queries ```sql CREATE INDEX idx_org_certifications_gin ON organizations USING GIN (certifications); CREATE INDEX idx_org_business_focus_gin ON organizations USING GIN (business_focus); CREATE INDEX idx_org_technical_expertise_gin ON organizations USING GIN (technical_expertise); CREATE INDEX idx_org_products_gin ON organizations USING GIN (products); CREATE INDEX idx_org_sells_products_gin ON organizations USING GIN (sells_products); CREATE INDEX idx_org_offers_services_gin ON organizations USING GIN (offers_services); ``` - **Impact**: Enables efficient filtering by certification, expertise, products/services - **Use Case**: "Find all organizations offering HVAC maintenance services" - ❌ **Missing**: Composite index for supply chain queries ```sql CREATE INDEX idx_org_supply_chain_sector ON organizations (supply_chain_role, industrial_sector); ``` - **Impact**: Optimizes supply chain network queries - ❌ **Missing**: Index on trust_score for partner discovery ```sql CREATE INDEX idx_org_trust_score ON organizations (trust_score DESC); ``` - **Impact**: Enables efficient filtering by trust level 2. **Sites Table** - ❌ **Missing**: Composite index for site search ```sql CREATE INDEX idx_sites_type_ownership ON sites (site_type, ownership); ``` - **Impact**: Optimizes queries for available/leased industrial sites - ❌ **Missing**: GIN indexes on JSONB fields ```sql CREATE INDEX idx_sites_utilities_gin ON sites USING GIN (available_utilities); CREATE INDEX idx_sites_waste_mgmt_gin ON sites USING GIN (waste_management); ``` - **Impact**: Enables filtering by specific utility availability - ❌ **Missing**: Functional index for available capacity ```sql CREATE INDEX idx_sites_has_space ON sites ((floor_area_m2 > 0)) WHERE floor_area_m2 IS NOT NULL; ``` 3. **Resource Flows Table** - ❌ **Missing**: Critical composite index for matching algorithm ```sql CREATE INDEX idx_rf_matching ON resource_flows (type, direction, precision_level) WHERE precision_level IN ('measured', 'estimated'); ``` - **Impact**: Core matching query optimization (currently no data, but critical for future) - ❌ **Missing**: GIN indexes on JSONB fields ```sql CREATE INDEX idx_rf_quality_gin ON resource_flows USING GIN (quality); CREATE INDEX idx_rf_constraints_gin ON resource_flows USING GIN (constraints); ``` 4. **Addresses Table** - ❌ **Missing**: Full-text search index ```sql CREATE INDEX idx_addresses_formatted_ru_trgm ON addresses USING GIN (formatted_ru gin_trgm_ops); ``` - **Requires**: `CREATE EXTENSION IF NOT EXISTS pg_trgm;` - **Impact**: Enables fuzzy address search - ❌ **Missing**: Composite index for city/region queries ```sql CREATE INDEX idx_addresses_city_region ON addresses (city, region); ``` #### **Medium Priority - Query Optimization** 5. **Shared Assets Table** - ❌ **Missing**: Index for availability queries ```sql CREATE INDEX idx_shared_assets_available ON shared_assets (type, utilization_rate) WHERE operational_status = 'operational' AND utilization_rate < 1.0; ``` - **Impact**: Quickly find available shared equipment - ❌ **Missing**: GIN index on current users ```sql CREATE INDEX idx_shared_assets_users_gin ON shared_assets USING GIN (current_users); ``` 6. **Matches Table** - ❌ **Missing**: Composite index for active matches ```sql CREATE INDEX idx_matches_active ON matches (status, priority DESC, compatibility_score DESC) WHERE status IN ('suggested', 'negotiating', 'reserved'); ``` - ❌ **Missing**: Index for expiring reservations ```sql CREATE INDEX idx_matches_expiring ON matches (reserved_until) WHERE reserved_until IS NOT NULL AND status = 'reserved'; ``` --- ### 1.2 Redundant or Inefficient Indexes 1. **Organizations Table** - ⚠️ **Redundant**: `idx_organizations_created_at` - Low cardinality, rarely used alone - **Recommendation**: Consider dropping if not used for time-series analysis 2. **Sites Table** - ⚠️ **Dual Spatial Indexes**: Both `idx_site_location` (btree) and `idx_site_geometry` (gist) - **Recommendation**: Keep GIST for PostGIS operations, drop btree if not needed for exact lookups --- ### 1.3 Missing Constraints and Data Integrity 1. **Foreign Key Constraints** ✅ Good coverage - All major relationships have FK constraints 2. **Check Constraints Needed** ```sql -- Organizations ALTER TABLE organizations ADD CONSTRAINT chk_org_trust_score CHECK (trust_score >= 0 AND trust_score <= 1); ALTER TABLE organizations ADD CONSTRAINT chk_org_company_size CHECK (company_size >= 0); -- Sites ALTER TABLE sites ADD CONSTRAINT chk_site_floor_area CHECK (floor_area_m2 >= 0); ALTER TABLE sites ADD CONSTRAINT chk_site_capacity CHECK (crane_capacity_tonnes >= 0); -- Resource Flows ALTER TABLE resource_flows ADD CONSTRAINT chk_rf_valid_direction CHECK (direction IN ('input', 'output')); -- Shared Assets ALTER TABLE shared_assets ADD CONSTRAINT chk_asset_capacity CHECK (capacity >= 0); ``` 3. **Partial Unique Constraints Needed** ```sql -- Prevent duplicate active matches for same resource pairs CREATE UNIQUE INDEX idx_matches_unique_active ON matches (source_resource_id, target_resource_id) WHERE status IN ('negotiating', 'reserved', 'contracted', 'live'); ``` --- ## 2. Neo4j Graph Database Analysis ### 2.1 Missing Critical Indexes #### **High Priority** 1. **Organization Nodes** - ❌ **Missing**: Full-text search index ```cypher CREATE FULLTEXT INDEX organization_search_idx FOR (o:Organization) ON EACH [o.name, o.description]; ``` - ❌ **Missing**: Composite index for sector searches ```cypher CREATE INDEX org_sector_subtype_idx FOR (o:Organization) ON (o.sector, o.subtype); ``` - ❌ **Missing**: Point index for spatial queries (currently using lat/long separately) ```cypher // After adding point property to nodes CREATE POINT INDEX org_location_point_idx FOR (o:Organization) ON (o.location); ``` 2. **Site Nodes** - ❌ **Missing**: Point index for efficient spatial operations ```cypher CREATE POINT INDEX site_location_point_idx FOR (s:Site) ON (s.location); ``` - **Impact**: Dramatically improves nearby site queries 3. **Address Nodes** - ❌ **Missing**: ALL indexes! ```cypher CREATE INDEX address_city_idx FOR (a:Address) ON (a.city); CREATE INDEX address_region_idx FOR (a:Address) ON (a.region); CREATE FULLTEXT INDEX address_search_idx FOR (a:Address) ON EACH [a.formatted_ru, a.formatted_en]; ``` 4. **Resource Flow Nodes** - ❌ **Missing**: Composite index for matching ```cypher CREATE INDEX rf_org_site_idx FOR (rf:ResourceFlow) ON (rf.organization_id, rf.site_id); ``` --- ### 2.2 Missing Critical Relationships #### **High Priority - Essential for Graph Queries** 1. **Address Relationships** - ❌ **LOCATED_AT exists but not populated!** (0 relationships currently) - **Should Have**: Organization→Address, Site→Address - **Fix Required**: Update graph sync to create these relationships ```cypher // Example of what should exist: MATCH (o:Organization {id: $org_id}), (a:Address {id: $addr_id}) CREATE (o)-[:LOCATED_AT]->(a) ``` 2. **Resource Flow Relationships** - ❌ **HOSTS**: Site→ResourceFlow (not yet created) ```cypher CREATE (s:Site)-[:HOSTS]->(rf:ResourceFlow) ``` - ❌ **PRODUCES/CONSUMES**: Organization→ResourceFlow (semantic clarity) ```cypher CREATE (o:Organization)-[:PRODUCES]->(rf:ResourceFlow {direction: 'output'}) CREATE (o:Organization)-[:CONSUMES]->(rf:ResourceFlow {direction: 'input'}) ``` 3. **Match Relationships** - ❌ **MATCHES**: ResourceFlow→ResourceFlow (not yet created) ```cypher CREATE (source:ResourceFlow)-[:MATCHES {score: 0.85}]->(target:ResourceFlow) ``` 4. **Supply Chain Relationships** - ❌ **SUPPLIES**: Organization→Organization (supplier network) - ❌ **COLLABORATES_WITH**: Organization→Organization (existing partnerships) - ❌ **TRUSTS**: Organization→Organization (trust network from JSONB field) ```cypher CREATE (o1:Organization)-[:SUPPLIES {products: ['heat', 'steam']}]->(o2:Organization) CREATE (o1:Organization)-[:TRUSTS {score: 0.9}]->(o2:Organization) ``` 5. **Shared Asset Relationships** - ❌ **OWNS_ASSET**: Organization→SharedAsset - ❌ **USES_ASSET**: Organization→SharedAsset - ❌ **HAS_ASSET**: Site→SharedAsset ```cypher CREATE (o:Organization)-[:OWNS_ASSET]->(sa:SharedAsset) CREATE (o:Organization)-[:USES_ASSET {since: date()}]->(sa:SharedAsset) ``` 6. **Spatial Proximity Relationships** (Advanced) - 🔮 **NEAR**: Organization→Organization / Site→Site (for proximity analysis) ```cypher // Create relationships for entities within 5km MATCH (s1:Site), (s2:Site) WHERE s1.id < s2.id AND point.distance(s1.location, s2.location) < 5000 CREATE (s1)-[:NEAR {distance_m: point.distance(s1.location, s2.location)}]->(s2) ``` #### **Medium Priority - Enhanced Analytics** 7. **Temporal Relationships** - 🔮 **OPERATED_AT**: Organization→Site (with time range) ```cypher CREATE (o:Organization)-[:OPERATED_AT {from: date('2020-01-01'), to: date('2023-12-31')}]->(s:Site) ``` 8. **Categorical Relationships** - 🔮 **IN_SECTOR**: Organization→Sector (for hierarchical sector queries) - 🔮 **OF_TYPE**: Site→SiteType (for type-based traversal) --- ### 2.3 Graph Schema Improvements #### **Property Graph Enhancements** 1. **Add Point Properties for Spatial Queries** ```cypher // Instead of storing lat/long separately, use Neo4j Point type MATCH (o:Organization) WHERE o.latitude IS NOT NULL AND o.longitude IS NOT NULL SET o.location = point({latitude: o.latitude, longitude: o.longitude}) MATCH (s:Site) WHERE s.latitude IS NOT NULL AND s.longitude IS NOT NULL SET s.location = point({latitude: s.latitude, longitude: s.longitude}) ``` 2. **Add Computed Properties** ```cypher // Add degree centrality for network analysis MATCH (o:Organization) SET o.connection_count = size((o)-[:OPERATES_AT|SUPPLIES|COLLABORATES_WITH]-()) // Add resource diversity score MATCH (o:Organization) SET o.resource_types_count = size([ (o)-[:PRODUCES|CONSUMES]->(rf:ResourceFlow) | DISTINCT rf.type ]) ``` --- ## 3. Data Architecture Recommendations ### 3.1 PostgreSQL Optimizations #### **Immediate Actions (High ROI)** 1. **Enable Required Extensions** ```sql CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text search CREATE EXTENSION IF NOT EXISTS btree_gin; -- Multi-column GIN indexes ``` 2. **Create Missing Indexes (Priority Order)** - Organizations: JSONB GIN indexes (products, services, certifications) - Resource Flows: Matching composite index - Addresses: Trigram index for fuzzy search - Sites: Utilities GIN index - Matches: Active matches composite index 3. **Partition Large Tables** (When scale increases) ```sql -- Partition resource_flows by created_at (monthly partitions) CREATE TABLE resource_flows_partitioned ( LIKE resource_flows INCLUDING ALL ) PARTITION BY RANGE (created_at); -- Create partitions CREATE TABLE resource_flows_2025_11 PARTITION OF resource_flows_partitioned FOR VALUES FROM ('2025-11-01') TO ('2025-12-01'); ``` 4. **Materialized Views for Common Queries** ```sql -- Active matches dashboard CREATE MATERIALIZED VIEW mv_active_matches AS SELECT m.*, so.name as source_org_name, to.name as target_org_name, srf.type as resource_type FROM matches m JOIN resource_flows srf ON m.source_resource_id = srf.id JOIN resource_flows trf ON m.target_resource_id = trf.id JOIN organizations so ON srf.organization_id = so.id JOIN organizations to ON trf.organization_id = to.id WHERE m.status IN ('suggested', 'negotiating', 'reserved', 'live') WITH DATA; CREATE UNIQUE INDEX ON mv_active_matches (id); CREATE INDEX ON mv_active_matches (compatibility_score DESC); -- Refresh strategy (can be automated with cron) REFRESH MATERIALIZED VIEW CONCURRENTLY mv_active_matches; ``` #### **Medium-Term Optimizations** 5. **JSONB Field Normalization** (When querying becomes complex) - Consider extracting frequently queried JSONB fields to columns - Example: `products` JSONB → separate `organization_products` table 6. **Archive Old Data** ```sql -- Archive old matches to separate table CREATE TABLE matches_archive (LIKE matches INCLUDING ALL); INSERT INTO matches_archive SELECT * FROM matches WHERE status IN ('failed', 'cancelled', 'completed') AND updated_at < NOW() - INTERVAL '1 year'; DELETE FROM matches WHERE id IN (SELECT id FROM matches_archive); ``` --- ### 3.2 Neo4j Optimizations #### **Immediate Actions** 1. **Populate Missing Relationships** - Fix Address sync to create LOCATED_AT relationships - Add HOSTS relationships during ResourceFlow sync - Create MATCHES relationships during Match sync 2. **Create Critical Indexes** ```cypher // Full-text search CREATE FULLTEXT INDEX organization_search FOR (o:Organization) ON EACH [o.name, o.description]; CREATE FULLTEXT INDEX site_search FOR (s:Site) ON EACH [s.name, s.current_use]; // Spatial CREATE POINT INDEX org_location FOR (o:Organization) ON (o.location); CREATE POINT INDEX site_location FOR (s:Site) ON (s.location); // Address CREATE INDEX address_city FOR (a:Address) ON (a.city); ``` 3. **Add Point Properties** ```cypher // Convert lat/lng to Point type for efficient spatial queries MATCH (n:Organization) WHERE n.latitude IS NOT NULL SET n.location = point({latitude: n.latitude, longitude: n.longitude}); MATCH (n:Site) WHERE n.latitude IS NOT NULL SET n.location = point({latitude: n.latitude, longitude: n.longitude}); ``` #### **Enhanced Graph Structure** 4. **Create Proximity Relationships** ```cypher // Create NEAR relationships for sites within 5km MATCH (s1:Site), (s2:Site) WHERE s1.id < s2.id AND point.distance(s1.location, s2.location) < 5000 CREATE (s1)-[:NEAR { distance_m: point.distance(s1.location, s2.location), created_at: datetime() }]->(s2); ``` 5. **Extract Network from JSONB** ```cypher // Create TRUSTS relationships from trust_network JSONB field MATCH (o:Organization) WHERE o.trust_network IS NOT NULL UNWIND o.trust_network AS trusted_id MATCH (trusted:Organization {id: trusted_id}) MERGE (o)-[:TRUSTS {score: 0.8}]->(trusted); ``` 6. **Create Hierarchical Structures** ```cypher // Create Sector nodes for better taxonomy queries CREATE (:Sector {name: 'manufacturing', level: 'primary'}) CREATE (:Sector {name: 'oil_and_gas', level: 'secondary', parent: 'manufacturing'}) // Connect organizations to sectors MATCH (o:Organization {sector: 'oil_and_gas'}) MATCH (s:Sector {name: 'oil_and_gas'}) CREATE (o)-[:IN_SECTOR]->(s); ``` --- ## 4. Query Performance Examples ### 4.1 Before Optimization **Query: Find potential heat matches within 10km** ```cypher // Current (inefficient) MATCH (source:ResourceFlow {type: 'heat', direction: 'output'}) MATCH (target:ResourceFlow {type: 'heat', direction: 'input'}) MATCH (source)-[:HOSTS]-(ss:Site) MATCH (target)-[:HOSTS]-(ts:Site) WHERE point.distance( point({latitude: ss.latitude, longitude: ss.longitude}), point({latitude: ts.latitude, longitude: ts.longitude}) ) < 10000 RETURN source, target ``` **Estimated**: 5-10 seconds on 10,000 flows ### 4.2 After Optimization ```cypher // Optimized with Point properties and indexes MATCH (source:ResourceFlow {type: 'heat', direction: 'output'}) MATCH (target:ResourceFlow {type: 'heat', direction: 'input'}) MATCH (source)<-[:HOSTS]-(ss:Site) MATCH (target)<-[:HOSTS]-(ts:Site) WHERE point.distance(ss.location, ts.location) < 10000 RETURN source, target, point.distance(ss.location, ts.location) AS distance ORDER BY distance ``` **Estimated**: <1 second with spatial index --- ## 5. Implementation Priority Matrix ### Phase 1: Critical (Week 1) 1. ✅ Create PostgreSQL GIN indexes on JSONB fields 2. ✅ Add Point properties to Neo4j nodes 3. ✅ Create Neo4j spatial indexes 4. ✅ Fix Address LOCATED_AT relationship sync 5. ✅ Add full-text search indexes in Neo4j ### Phase 2: Important (Week 2-3) 6. ✅ Create HOSTS relationships (Site→ResourceFlow) 7. ✅ Create MATCHES relationships 8. ✅ Add composite indexes in PostgreSQL 9. ✅ Create SUPPLIES/TRUSTS relationships 10. ✅ Add check constraints ### Phase 3: Optimization (Month 2) 11. ⏰ Create materialized views 12. ⏰ Add NEAR proximity relationships 13. ⏰ Implement sector hierarchy 14. ⏰ Archive old data ### Phase 4: Advanced (Month 3+) 15. 🔮 Partition large tables 16. 🔮 Normalize complex JSONB fields 17. 🔮 Add graph analytics (PageRank, Community Detection) 18. 🔮 Implement time-series partitioning --- ## 6. Expected Performance Improvements ### Query Performance | Query Type | Before | After | Improvement | |-----------|--------|-------|-------------| | Find organizations by product | Full scan | Index scan | 100x faster | | Spatial proximity (10km radius) | 5-10s | <100ms | 50-100x faster | | Match discovery | N/A | <500ms | New capability | | Full-text search | Not possible | <200ms | New capability | | Supply chain traversal | N/A | <1s (3 hops) | New capability | ### Storage Efficiency | Aspect | Current | Optimized | Benefit | |--------|---------|-----------|---------| | Index size | ~50MB | ~150MB | Better query performance | | Redundant data | High (JSONB overlap) | Medium | Consider normalization later | | Graph density | Low (3 rel types) | High (10+ rel types) | Richer analytics | --- ## 7. Monitoring Recommendations ### PostgreSQL ```sql -- Identify slow queries CREATE EXTENSION pg_stat_statements; -- Check index usage SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan ASC; -- Find unused indexes SELECT indexrelname, idx_scan, idx_tup_read FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey'; ``` ### Neo4j ```cypher // Query profiling PROFILE MATCH (o:Organization)-[:OPERATES_AT]->(s:Site) WHERE point.distance(o.location, point({latitude: 54.5, longitude: 52.3})) < 5000 RETURN o, s; // Check index usage CALL db.stats.retrieve('QUERIES'); ``` --- ## 8. Next Steps 1. **Review this report** with the development team 2. **Prioritize implementations** based on immediate needs 3. **Create migration scripts** for index creation 4. **Update graph sync service** to create missing relationships 5. **Add monitoring** for query performance 6. **Schedule maintenance windows** for index creation 7. **Document new query patterns** for the team --- ## Appendix A: Complete Index Creation Script See `migrations/add_optimization_indexes.sql` (to be created) ## Appendix B: Neo4j Relationship Sync Updates See `internal/repository/graph_*_repository.go` updates needed --- *End of Report*