turash/bugulma/backend/DATA_OPTIMIZATION_REPORT.md
Damir Mukimov 000eab4740
Major repository reorganization and missing backend endpoints implementation
Repository Structure:
- Move files from cluttered root directory into organized structure
- Create archive/ for archived data and scraper results
- Create bugulma/ for the complete application (frontend + backend)
- Create data/ for sample datasets and reference materials
- Create docs/ for comprehensive documentation structure
- Create scripts/ for utility scripts and API tools

Backend Implementation:
- Implement 3 missing backend endpoints identified in gap analysis:
  * GET /api/v1/organizations/{id}/matching/direct - Direct symbiosis matches
  * GET /api/v1/users/me/organizations - User organizations
  * POST /api/v1/proposals/{id}/status - Update proposal status
- Add complete proposal domain model, repository, and service layers
- Create database migration for proposals table
- Fix CLI server command registration issue

API Documentation:
- Add comprehensive proposals.md API documentation
- Update README.md with Users and Proposals API sections
- Document all request/response formats, error codes, and business rules

Code Quality:
- Follow existing Go backend architecture patterns
- Add proper error handling and validation
- Match frontend expected response schemas
- Maintain clean separation of concerns (handler -> service -> repository)
2025-11-25 06:01:16 +01:00

677 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Data Structure Optimization Report
*Generated: November 23, 2025*
## Executive Summary
This report analyzes the current data architecture across PostgreSQL and Neo4j databases for the city resource graph platform. The analysis identifies opportunities to improve performance, data integrity, and query efficiency through better indexing, relationship modeling, and architectural decisions.
**Current State:**
- PostgreSQL: 20 tables, 64 indexes
- Neo4j: 7 node types, 3 relationship types, 17 indexes
- Data Volume: 1,076 organizations, 9,133 sites, 9,862 addresses
---
## 1. PostgreSQL Database Analysis
### 1.1 Missing Critical Indexes
#### **High Priority - Performance Critical**
1. **Organizations Table**
-**Missing**: GIN index on JSONB fields for advanced queries
```sql
CREATE INDEX idx_org_certifications_gin ON organizations USING GIN (certifications);
CREATE INDEX idx_org_business_focus_gin ON organizations USING GIN (business_focus);
CREATE INDEX idx_org_technical_expertise_gin ON organizations USING GIN (technical_expertise);
CREATE INDEX idx_org_products_gin ON organizations USING GIN (products);
CREATE INDEX idx_org_sells_products_gin ON organizations USING GIN (sells_products);
CREATE INDEX idx_org_offers_services_gin ON organizations USING GIN (offers_services);
```
- **Impact**: Enables efficient filtering by certification, expertise, products/services
- **Use Case**: "Find all organizations offering HVAC maintenance services"
-**Missing**: Composite index for supply chain queries
```sql
CREATE INDEX idx_org_supply_chain_sector ON organizations (supply_chain_role, industrial_sector);
```
- **Impact**: Optimizes supply chain network queries
-**Missing**: Index on trust_score for partner discovery
```sql
CREATE INDEX idx_org_trust_score ON organizations (trust_score DESC);
```
- **Impact**: Enables efficient filtering by trust level
2. **Sites Table**
-**Missing**: Composite index for site search
```sql
CREATE INDEX idx_sites_type_ownership ON sites (site_type, ownership);
```
- **Impact**: Optimizes queries for available/leased industrial sites
-**Missing**: GIN indexes on JSONB fields
```sql
CREATE INDEX idx_sites_utilities_gin ON sites USING GIN (available_utilities);
CREATE INDEX idx_sites_waste_mgmt_gin ON sites USING GIN (waste_management);
```
- **Impact**: Enables filtering by specific utility availability
-**Missing**: Functional index for available capacity
```sql
CREATE INDEX idx_sites_has_space ON sites ((floor_area_m2 > 0)) WHERE floor_area_m2 IS NOT NULL;
```
3. **Resource Flows Table**
-**Missing**: Critical composite index for matching algorithm
```sql
CREATE INDEX idx_rf_matching ON resource_flows (type, direction, precision_level)
WHERE precision_level IN ('measured', 'estimated');
```
- **Impact**: Core matching query optimization (currently no data, but critical for future)
-**Missing**: GIN indexes on JSONB fields
```sql
CREATE INDEX idx_rf_quality_gin ON resource_flows USING GIN (quality);
CREATE INDEX idx_rf_constraints_gin ON resource_flows USING GIN (constraints);
```
4. **Addresses Table**
-**Missing**: Full-text search index
```sql
CREATE INDEX idx_addresses_formatted_ru_trgm ON addresses USING GIN (formatted_ru gin_trgm_ops);
```
- **Requires**: `CREATE EXTENSION IF NOT EXISTS pg_trgm;`
- **Impact**: Enables fuzzy address search
-**Missing**: Composite index for city/region queries
```sql
CREATE INDEX idx_addresses_city_region ON addresses (city, region);
```
#### **Medium Priority - Query Optimization**
5. **Shared Assets Table**
-**Missing**: Index for availability queries
```sql
CREATE INDEX idx_shared_assets_available ON shared_assets (type, utilization_rate)
WHERE operational_status = 'operational' AND utilization_rate < 1.0;
```
- **Impact**: Quickly find available shared equipment
- **Missing**: GIN index on current users
```sql
CREATE INDEX idx_shared_assets_users_gin ON shared_assets USING GIN (current_users);
```
6. **Matches Table**
- **Missing**: Composite index for active matches
```sql
CREATE INDEX idx_matches_active ON matches (status, priority DESC, compatibility_score DESC)
WHERE status IN ('suggested', 'negotiating', 'reserved');
```
- **Missing**: Index for expiring reservations
```sql
CREATE INDEX idx_matches_expiring ON matches (reserved_until)
WHERE reserved_until IS NOT NULL AND status = 'reserved';
```
---
### 1.2 Redundant or Inefficient Indexes
1. **Organizations Table**
- **Redundant**: `idx_organizations_created_at` - Low cardinality, rarely used alone
- **Recommendation**: Consider dropping if not used for time-series analysis
2. **Sites Table**
- **Dual Spatial Indexes**: Both `idx_site_location` (btree) and `idx_site_geometry` (gist)
- **Recommendation**: Keep GIST for PostGIS operations, drop btree if not needed for exact lookups
---
### 1.3 Missing Constraints and Data Integrity
1. **Foreign Key Constraints** Good coverage
- All major relationships have FK constraints
2. **Check Constraints Needed**
```sql
-- Organizations
ALTER TABLE organizations ADD CONSTRAINT chk_org_trust_score
CHECK (trust_score >= 0 AND trust_score <= 1);
ALTER TABLE organizations ADD CONSTRAINT chk_org_company_size
CHECK (company_size >= 0);
-- Sites
ALTER TABLE sites ADD CONSTRAINT chk_site_floor_area
CHECK (floor_area_m2 >= 0);
ALTER TABLE sites ADD CONSTRAINT chk_site_capacity
CHECK (crane_capacity_tonnes >= 0);
-- Resource Flows
ALTER TABLE resource_flows ADD CONSTRAINT chk_rf_valid_direction
CHECK (direction IN ('input', 'output'));
-- Shared Assets
ALTER TABLE shared_assets ADD CONSTRAINT chk_asset_capacity
CHECK (capacity >= 0);
```
3. **Partial Unique Constraints Needed**
```sql
-- Prevent duplicate active matches for same resource pairs
CREATE UNIQUE INDEX idx_matches_unique_active ON matches (source_resource_id, target_resource_id)
WHERE status IN ('negotiating', 'reserved', 'contracted', 'live');
```
---
## 2. Neo4j Graph Database Analysis
### 2.1 Missing Critical Indexes
#### **High Priority**
1. **Organization Nodes**
-**Missing**: Full-text search index
```cypher
CREATE FULLTEXT INDEX organization_search_idx FOR (o:Organization) ON EACH [o.name, o.description];
```
-**Missing**: Composite index for sector searches
```cypher
CREATE INDEX org_sector_subtype_idx FOR (o:Organization) ON (o.sector, o.subtype);
```
-**Missing**: Point index for spatial queries (currently using lat/long separately)
```cypher
// After adding point property to nodes
CREATE POINT INDEX org_location_point_idx FOR (o:Organization) ON (o.location);
```
2. **Site Nodes**
-**Missing**: Point index for efficient spatial operations
```cypher
CREATE POINT INDEX site_location_point_idx FOR (s:Site) ON (s.location);
```
- **Impact**: Dramatically improves nearby site queries
3. **Address Nodes**
-**Missing**: ALL indexes!
```cypher
CREATE INDEX address_city_idx FOR (a:Address) ON (a.city);
CREATE INDEX address_region_idx FOR (a:Address) ON (a.region);
CREATE FULLTEXT INDEX address_search_idx FOR (a:Address) ON EACH [a.formatted_ru, a.formatted_en];
```
4. **Resource Flow Nodes**
-**Missing**: Composite index for matching
```cypher
CREATE INDEX rf_org_site_idx FOR (rf:ResourceFlow) ON (rf.organization_id, rf.site_id);
```
---
### 2.2 Missing Critical Relationships
#### **High Priority - Essential for Graph Queries**
1. **Address Relationships**
-**LOCATED_AT exists but not populated!** (0 relationships currently)
- **Should Have**: Organization→Address, Site→Address
- **Fix Required**: Update graph sync to create these relationships
```cypher
// Example of what should exist:
MATCH (o:Organization {id: $org_id}), (a:Address {id: $addr_id})
CREATE (o)-[:LOCATED_AT]->(a)
```
2. **Resource Flow Relationships**
-**HOSTS**: Site→ResourceFlow (not yet created)
```cypher
CREATE (s:Site)-[:HOSTS]->(rf:ResourceFlow)
```
-**PRODUCES/CONSUMES**: Organization→ResourceFlow (semantic clarity)
```cypher
CREATE (o:Organization)-[:PRODUCES]->(rf:ResourceFlow {direction: 'output'})
CREATE (o:Organization)-[:CONSUMES]->(rf:ResourceFlow {direction: 'input'})
```
3. **Match Relationships**
-**MATCHES**: ResourceFlow→ResourceFlow (not yet created)
```cypher
CREATE (source:ResourceFlow)-[:MATCHES {score: 0.85}]->(target:ResourceFlow)
```
4. **Supply Chain Relationships**
-**SUPPLIES**: Organization→Organization (supplier network)
-**COLLABORATES_WITH**: Organization→Organization (existing partnerships)
-**TRUSTS**: Organization→Organization (trust network from JSONB field)
```cypher
CREATE (o1:Organization)-[:SUPPLIES {products: ['heat', 'steam']}]->(o2:Organization)
CREATE (o1:Organization)-[:TRUSTS {score: 0.9}]->(o2:Organization)
```
5. **Shared Asset Relationships**
-**OWNS_ASSET**: Organization→SharedAsset
-**USES_ASSET**: Organization→SharedAsset
-**HAS_ASSET**: Site→SharedAsset
```cypher
CREATE (o:Organization)-[:OWNS_ASSET]->(sa:SharedAsset)
CREATE (o:Organization)-[:USES_ASSET {since: date()}]->(sa:SharedAsset)
```
6. **Spatial Proximity Relationships** (Advanced)
- 🔮 **NEAR**: Organization→Organization / Site→Site (for proximity analysis)
```cypher
// Create relationships for entities within 5km
MATCH (s1:Site), (s2:Site)
WHERE s1.id < s2.id
AND point.distance(s1.location, s2.location) < 5000
CREATE (s1)-[:NEAR {distance_m: point.distance(s1.location, s2.location)}]->(s2)
```
#### **Medium Priority - Enhanced Analytics**
7. **Temporal Relationships**
- 🔮 **OPERATED_AT**: Organization→Site (with time range)
```cypher
CREATE (o:Organization)-[:OPERATED_AT {from: date('2020-01-01'), to: date('2023-12-31')}]->(s:Site)
```
8. **Categorical Relationships**
- 🔮 **IN_SECTOR**: Organization→Sector (for hierarchical sector queries)
- 🔮 **OF_TYPE**: Site→SiteType (for type-based traversal)
---
### 2.3 Graph Schema Improvements
#### **Property Graph Enhancements**
1. **Add Point Properties for Spatial Queries**
```cypher
// Instead of storing lat/long separately, use Neo4j Point type
MATCH (o:Organization)
WHERE o.latitude IS NOT NULL AND o.longitude IS NOT NULL
SET o.location = point({latitude: o.latitude, longitude: o.longitude})
MATCH (s:Site)
WHERE s.latitude IS NOT NULL AND s.longitude IS NOT NULL
SET s.location = point({latitude: s.latitude, longitude: s.longitude})
```
2. **Add Computed Properties**
```cypher
// Add degree centrality for network analysis
MATCH (o:Organization)
SET o.connection_count = size((o)-[:OPERATES_AT|SUPPLIES|COLLABORATES_WITH]-())
// Add resource diversity score
MATCH (o:Organization)
SET o.resource_types_count = size([
(o)-[:PRODUCES|CONSUMES]->(rf:ResourceFlow) | DISTINCT rf.type
])
```
---
## 3. Data Architecture Recommendations
### 3.1 PostgreSQL Optimizations
#### **Immediate Actions (High ROI)**
1. **Enable Required Extensions**
```sql
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text search
CREATE EXTENSION IF NOT EXISTS btree_gin; -- Multi-column GIN indexes
```
2. **Create Missing Indexes (Priority Order)**
- Organizations: JSONB GIN indexes (products, services, certifications)
- Resource Flows: Matching composite index
- Addresses: Trigram index for fuzzy search
- Sites: Utilities GIN index
- Matches: Active matches composite index
3. **Partition Large Tables** (When scale increases)
```sql
-- Partition resource_flows by created_at (monthly partitions)
CREATE TABLE resource_flows_partitioned (
LIKE resource_flows INCLUDING ALL
) PARTITION BY RANGE (created_at);
-- Create partitions
CREATE TABLE resource_flows_2025_11 PARTITION OF resource_flows_partitioned
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
```
4. **Materialized Views for Common Queries**
```sql
-- Active matches dashboard
CREATE MATERIALIZED VIEW mv_active_matches AS
SELECT
m.*,
so.name as source_org_name,
to.name as target_org_name,
srf.type as resource_type
FROM matches m
JOIN resource_flows srf ON m.source_resource_id = srf.id
JOIN resource_flows trf ON m.target_resource_id = trf.id
JOIN organizations so ON srf.organization_id = so.id
JOIN organizations to ON trf.organization_id = to.id
WHERE m.status IN ('suggested', 'negotiating', 'reserved', 'live')
WITH DATA;
CREATE UNIQUE INDEX ON mv_active_matches (id);
CREATE INDEX ON mv_active_matches (compatibility_score DESC);
-- Refresh strategy (can be automated with cron)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_active_matches;
```
#### **Medium-Term Optimizations**
5. **JSONB Field Normalization** (When querying becomes complex)
- Consider extracting frequently queried JSONB fields to columns
- Example: `products` JSONB → separate `organization_products` table
6. **Archive Old Data**
```sql
-- Archive old matches to separate table
CREATE TABLE matches_archive (LIKE matches INCLUDING ALL);
INSERT INTO matches_archive
SELECT * FROM matches
WHERE status IN ('failed', 'cancelled', 'completed')
AND updated_at < NOW() - INTERVAL '1 year';
DELETE FROM matches
WHERE id IN (SELECT id FROM matches_archive);
```
---
### 3.2 Neo4j Optimizations
#### **Immediate Actions**
1. **Populate Missing Relationships**
- Fix Address sync to create LOCATED_AT relationships
- Add HOSTS relationships during ResourceFlow sync
- Create MATCHES relationships during Match sync
2. **Create Critical Indexes**
```cypher
// Full-text search
CREATE FULLTEXT INDEX organization_search FOR (o:Organization) ON EACH [o.name, o.description];
CREATE FULLTEXT INDEX site_search FOR (s:Site) ON EACH [s.name, s.current_use];
// Spatial
CREATE POINT INDEX org_location FOR (o:Organization) ON (o.location);
CREATE POINT INDEX site_location FOR (s:Site) ON (s.location);
// Address
CREATE INDEX address_city FOR (a:Address) ON (a.city);
```
3. **Add Point Properties**
```cypher
// Convert lat/lng to Point type for efficient spatial queries
MATCH (n:Organization)
WHERE n.latitude IS NOT NULL
SET n.location = point({latitude: n.latitude, longitude: n.longitude});
MATCH (n:Site)
WHERE n.latitude IS NOT NULL
SET n.location = point({latitude: n.latitude, longitude: n.longitude});
```
#### **Enhanced Graph Structure**
4. **Create Proximity Relationships**
```cypher
// Create NEAR relationships for sites within 5km
MATCH (s1:Site), (s2:Site)
WHERE s1.id < s2.id
AND point.distance(s1.location, s2.location) < 5000
CREATE (s1)-[:NEAR {
distance_m: point.distance(s1.location, s2.location),
created_at: datetime()
}]->(s2);
```
5. **Extract Network from JSONB**
```cypher
// Create TRUSTS relationships from trust_network JSONB field
MATCH (o:Organization)
WHERE o.trust_network IS NOT NULL
UNWIND o.trust_network AS trusted_id
MATCH (trusted:Organization {id: trusted_id})
MERGE (o)-[:TRUSTS {score: 0.8}]->(trusted);
```
6. **Create Hierarchical Structures**
```cypher
// Create Sector nodes for better taxonomy queries
CREATE (:Sector {name: 'manufacturing', level: 'primary'})
CREATE (:Sector {name: 'oil_and_gas', level: 'secondary', parent: 'manufacturing'})
// Connect organizations to sectors
MATCH (o:Organization {sector: 'oil_and_gas'})
MATCH (s:Sector {name: 'oil_and_gas'})
CREATE (o)-[:IN_SECTOR]->(s);
```
---
## 4. Query Performance Examples
### 4.1 Before Optimization
**Query: Find potential heat matches within 10km**
```cypher
// Current (inefficient)
MATCH (source:ResourceFlow {type: 'heat', direction: 'output'})
MATCH (target:ResourceFlow {type: 'heat', direction: 'input'})
MATCH (source)-[:HOSTS]-(ss:Site)
MATCH (target)-[:HOSTS]-(ts:Site)
WHERE point.distance(
point({latitude: ss.latitude, longitude: ss.longitude}),
point({latitude: ts.latitude, longitude: ts.longitude})
) < 10000
RETURN source, target
```
**Estimated**: 5-10 seconds on 10,000 flows
### 4.2 After Optimization
```cypher
// Optimized with Point properties and indexes
MATCH (source:ResourceFlow {type: 'heat', direction: 'output'})
MATCH (target:ResourceFlow {type: 'heat', direction: 'input'})
MATCH (source)<-[:HOSTS]-(ss:Site)
MATCH (target)<-[:HOSTS]-(ts:Site)
WHERE point.distance(ss.location, ts.location) < 10000
RETURN source, target, point.distance(ss.location, ts.location) AS distance
ORDER BY distance
```
**Estimated**: <1 second with spatial index
---
## 5. Implementation Priority Matrix
### Phase 1: Critical (Week 1)
1. Create PostgreSQL GIN indexes on JSONB fields
2. Add Point properties to Neo4j nodes
3. Create Neo4j spatial indexes
4. Fix Address LOCATED_AT relationship sync
5. Add full-text search indexes in Neo4j
### Phase 2: Important (Week 2-3)
6. Create HOSTS relationships (SiteResourceFlow)
7. Create MATCHES relationships
8. Add composite indexes in PostgreSQL
9. Create SUPPLIES/TRUSTS relationships
10. Add check constraints
### Phase 3: Optimization (Month 2)
11. Create materialized views
12. Add NEAR proximity relationships
13. Implement sector hierarchy
14. Archive old data
### Phase 4: Advanced (Month 3+)
15. 🔮 Partition large tables
16. 🔮 Normalize complex JSONB fields
17. 🔮 Add graph analytics (PageRank, Community Detection)
18. 🔮 Implement time-series partitioning
---
## 6. Expected Performance Improvements
### Query Performance
| Query Type | Before | After | Improvement |
|-----------|--------|-------|-------------|
| Find organizations by product | Full scan | Index scan | 100x faster |
| Spatial proximity (10km radius) | 5-10s | <100ms | 50-100x faster |
| Match discovery | N/A | <500ms | New capability |
| Full-text search | Not possible | <200ms | New capability |
| Supply chain traversal | N/A | <1s (3 hops) | New capability |
### Storage Efficiency
| Aspect | Current | Optimized | Benefit |
|--------|---------|-----------|---------|
| Index size | ~50MB | ~150MB | Better query performance |
| Redundant data | High (JSONB overlap) | Medium | Consider normalization later |
| Graph density | Low (3 rel types) | High (10+ rel types) | Richer analytics |
---
## 7. Monitoring Recommendations
### PostgreSQL
```sql
-- Identify slow queries
CREATE EXTENSION pg_stat_statements;
-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Find unused indexes
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey';
```
### Neo4j
```cypher
// Query profiling
PROFILE
MATCH (o:Organization)-[:OPERATES_AT]->(s:Site)
WHERE point.distance(o.location, point({latitude: 54.5, longitude: 52.3})) < 5000
RETURN o, s;
// Check index usage
CALL db.stats.retrieve('QUERIES');
```
---
## 8. Next Steps
1. **Review this report** with the development team
2. **Prioritize implementations** based on immediate needs
3. **Create migration scripts** for index creation
4. **Update graph sync service** to create missing relationships
5. **Add monitoring** for query performance
6. **Schedule maintenance windows** for index creation
7. **Document new query patterns** for the team
---
## Appendix A: Complete Index Creation Script
See `migrations/add_optimization_indexes.sql` (to be created)
## Appendix B: Neo4j Relationship Sync Updates
See `internal/repository/graph_*_repository.go` updates needed
---
*End of Report*