## 7. Graph Database Design ### Base Technology **Graph Database Selection**: Start with **Neo4j** for MVP (best documentation, largest ecosystem), plan migration path to **TigerGraph** if scale exceeds 10B nodes. **Decision Criteria**: 1. **Scalability**: - Neo4j: Strong until ~50B nodes, then requires clustering - ArangoDB: Better horizontal scaling - TigerGraph: Designed for very large graphs (100B+ nodes) - Memgraph: Fast but less mature ecosystem 2. **Geospatial Support**: - Neo4j: Requires APOC library + PostGIS integration - ArangoDB: Built-in geospatial indexes - TigerGraph: Requires external PostGIS 3. **Query Performance**: Benchmark common queries (5km radius, temporal overlap, quality matching) 4. **Ecosystem**: Community size, cloud managed options, integration with existing stack 5. **Cost**: Licensing, cloud costs, operational complexity ### Relationships ``` (Business)-[:OPERATES_AT]->(Site) (Site)-[:HOSTS]->(ResourceFlow) (ResourceFlow)-[:MATCHABLE_TO {efficiency, distance, savings}]->(ResourceFlow) (Site)-[:HOSTS]->(SharedAsset) (Business)-[:OFFERS]->(Service) (Business)-[:SELLS]->(Product) ``` ### Hybrid Architecture for Geospatial Queries **Architecture**: - **Neo4j**: Stores graph structure, relationships, quality/temporal properties - **PostgreSQL+PostGIS**: Stores detailed geospatial data, handles complex distance calculations, spatial joins - **Synchronization**: Event-driven sync (Site created/updated → sync to PostGIS) **Query Pattern**: ``` 1. PostGIS: Find all sites within 5km radius (fast spatial index) 2. Neo4j: Filter by ResourceFlow types, quality, temporal overlap (graph traversal) 3. Join results in application layer or use Neo4j spatial plugin ``` **Alternative**: Use Neo4j APOC spatial procedures if graph is primary store. ### Zone-First Architecture for Data Sovereignty **Problem**: Global graph vs local adoption conflict - EU-wide matching requires unified schema, but local clusters need low-latency, sovereign data control. **Solution**: **Zone-first graph architecture** where each geographic/regulatory zone operates semi-autonomously: **Zone Types**: - **City Zones**: Municipal boundaries, operated by city governments - **Industrial Park Zones**: Single park operators, private industrial clusters - **Regional Zones**: County/state level, cross-municipality coordination - **Country Zones**: National regulatory compliance, standardized schemas **Architecture Pattern**: ``` Zone Database (Local Neo4j/PostgreSQL) ├── Local Graph: Sites, flows, businesses within zone ├── Local Rules: Zone-specific matching logic, regulations ├── Selective Publishing: Choose what to expose globally └── Data Sovereignty: Zone operator controls data visibility Global Federation Layer ├── Cross-zone matching requests ├── Federated queries (zone A requests zone B data) ├── Anonymized global analytics └── Selective data sharing agreements ``` **Key Benefits**: - **Data Sovereignty**: Cities/utilities control their data, GDPR compliance - **Low Latency**: Local queries stay within zone boundaries - **Regulatory Flexibility**: Each zone adapts to local waste/energy rules - **Scalable Adoption**: Start with single zones, federate gradually - **Trust Building**: Local operators maintain control while enabling cross-zone matches **Implementation**: - **Zone Registry**: Global catalog of active zones with API endpoints - **Federation Protocol**: Standardized cross-zone query interface - **Data Contracts**: Per-zone agreements on what data is shared globally - **Migration Path**: Start mono-zone, add federation as network grows ### Indexing Strategy **Required Indexes**: - **Spatial Index**: Site locations (latitude, longitude) - **Temporal Index**: ResourceFlow availability windows, seasonality - **Composite Indexes**: - (ResourceFlow.type, ResourceFlow.direction, Site.location) - (ResourceFlow.quality.temperature_celsius, ResourceFlow.type) - **Full-Text Search**: Business names, NACE codes, service domains **Index Maintenance**: - Monitor query performance and index usage - Use Neo4j's EXPLAIN PROFILE for query optimization - Consider partitioning large graphs by geographic regions ### Why Graph DB Queries like: "find all output nodes within 5 km producing heat 35–60 °C that matches any input nodes needing heat 30–55 °C, ΔT ≤ 10 K, availability overlap ≥ 70 %, and net savings > €0.02/kWh." That's a multi-criteria graph traversal — perfect fit. ---