turash/concept/09_graph_database_design.md
Damir Mukimov 4a2fda96cd
Initial commit: Repository setup with .gitignore, golangci-lint v2.6.0, and code quality checks
- Initialize git repository
- Add comprehensive .gitignore for Go projects
- Install golangci-lint v2.6.0 (latest v2) globally
- Configure .golangci.yml with appropriate linters and formatters
- Fix all formatting issues (gofmt)
- Fix all errcheck issues (unchecked errors)
- Adjust complexity threshold for validation functions
- All checks passing: build, test, vet, lint
2025-11-01 07:36:22 +01:00

4.5 KiB
Raw Blame History

7. Graph Database Design

Base Technology

Graph Database Selection: Start with Neo4j for MVP (best documentation, largest ecosystem), plan migration path to TigerGraph if scale exceeds 10B nodes.

Decision Criteria:

  1. Scalability:

    • Neo4j: Strong until ~50B nodes, then requires clustering
    • ArangoDB: Better horizontal scaling
    • TigerGraph: Designed for very large graphs (100B+ nodes)
    • Memgraph: Fast but less mature ecosystem
  2. Geospatial Support:

    • Neo4j: Requires APOC library + PostGIS integration
    • ArangoDB: Built-in geospatial indexes
    • TigerGraph: Requires external PostGIS
  3. Query Performance: Benchmark common queries (5km radius, temporal overlap, quality matching)

  4. Ecosystem: Community size, cloud managed options, integration with existing stack

  5. Cost: Licensing, cloud costs, operational complexity

Relationships

(Business)-[:OPERATES_AT]->(Site)
(Site)-[:HOSTS]->(ResourceFlow)
(ResourceFlow)-[:MATCHABLE_TO {efficiency, distance, savings}]->(ResourceFlow)
(Site)-[:HOSTS]->(SharedAsset)
(Business)-[:OFFERS]->(Service)
(Business)-[:SELLS]->(Product)

Hybrid Architecture for Geospatial Queries

Architecture:

  • Neo4j: Stores graph structure, relationships, quality/temporal properties
  • PostgreSQL+PostGIS: Stores detailed geospatial data, handles complex distance calculations, spatial joins
  • Synchronization: Event-driven sync (Site created/updated → sync to PostGIS)

Query Pattern:

1. PostGIS: Find all sites within 5km radius (fast spatial index)
2. Neo4j: Filter by ResourceFlow types, quality, temporal overlap (graph traversal)
3. Join results in application layer or use Neo4j spatial plugin

Alternative: Use Neo4j APOC spatial procedures if graph is primary store.

Zone-First Architecture for Data Sovereignty

Problem: Global graph vs local adoption conflict - EU-wide matching requires unified schema, but local clusters need low-latency, sovereign data control.

Solution: Zone-first graph architecture where each geographic/regulatory zone operates semi-autonomously:

Zone Types:

  • City Zones: Municipal boundaries, operated by city governments
  • Industrial Park Zones: Single park operators, private industrial clusters
  • Regional Zones: County/state level, cross-municipality coordination
  • Country Zones: National regulatory compliance, standardized schemas

Architecture Pattern:

Zone Database (Local Neo4j/PostgreSQL)
├── Local Graph: Sites, flows, businesses within zone
├── Local Rules: Zone-specific matching logic, regulations
├── Selective Publishing: Choose what to expose globally
└── Data Sovereignty: Zone operator controls data visibility

Global Federation Layer
├── Cross-zone matching requests
├── Federated queries (zone A requests zone B data)
├── Anonymized global analytics
└── Selective data sharing agreements

Key Benefits:

  • Data Sovereignty: Cities/utilities control their data, GDPR compliance
  • Low Latency: Local queries stay within zone boundaries
  • Regulatory Flexibility: Each zone adapts to local waste/energy rules
  • Scalable Adoption: Start with single zones, federate gradually
  • Trust Building: Local operators maintain control while enabling cross-zone matches

Implementation:

  • Zone Registry: Global catalog of active zones with API endpoints
  • Federation Protocol: Standardized cross-zone query interface
  • Data Contracts: Per-zone agreements on what data is shared globally
  • Migration Path: Start mono-zone, add federation as network grows

Indexing Strategy

Required Indexes:

  • Spatial Index: Site locations (latitude, longitude)
  • Temporal Index: ResourceFlow availability windows, seasonality
  • Composite Indexes:
    • (ResourceFlow.type, ResourceFlow.direction, Site.location)
    • (ResourceFlow.quality.temperature_celsius, ResourceFlow.type)
  • Full-Text Search: Business names, NACE codes, service domains

Index Maintenance:

  • Monitor query performance and index usage
  • Use Neo4j's EXPLAIN PROFILE for query optimization
  • Consider partitioning large graphs by geographic regions

Why Graph DB

Queries like: "find all output nodes within 5 km producing heat 3560 °C that matches any input nodes needing heat 3055 °C, ΔT ≤ 10 K, availability overlap ≥ 70 %, and net savings > €0.02/kWh."

That's a multi-criteria graph traversal — perfect fit.