turash/docs/concept/09_graph_database_design.md
Damir Mukimov 000eab4740
Major repository reorganization and missing backend endpoints implementation
Repository Structure:
- Move files from cluttered root directory into organized structure
- Create archive/ for archived data and scraper results
- Create bugulma/ for the complete application (frontend + backend)
- Create data/ for sample datasets and reference materials
- Create docs/ for comprehensive documentation structure
- Create scripts/ for utility scripts and API tools

Backend Implementation:
- Implement 3 missing backend endpoints identified in gap analysis:
  * GET /api/v1/organizations/{id}/matching/direct - Direct symbiosis matches
  * GET /api/v1/users/me/organizations - User organizations
  * POST /api/v1/proposals/{id}/status - Update proposal status
- Add complete proposal domain model, repository, and service layers
- Create database migration for proposals table
- Fix CLI server command registration issue

API Documentation:
- Add comprehensive proposals.md API documentation
- Update README.md with Users and Proposals API sections
- Document all request/response formats, error codes, and business rules

Code Quality:
- Follow existing Go backend architecture patterns
- Add proper error handling and validation
- Match frontend expected response schemas
- Maintain clean separation of concerns (handler -> service -> repository)
2025-11-25 06:01:16 +01:00

4.5 KiB
Raw Permalink Blame History

7. Graph Database Design

Base Technology

Graph Database Selection: Start with Neo4j for MVP (best documentation, largest ecosystem), plan migration path to TigerGraph if scale exceeds 10B nodes.

Decision Criteria:

  1. Scalability:

    • Neo4j: Strong until ~50B nodes, then requires clustering
    • ArangoDB: Better horizontal scaling
    • TigerGraph: Designed for very large graphs (100B+ nodes)
    • Memgraph: Fast but less mature ecosystem
  2. Geospatial Support:

    • Neo4j: Requires APOC library + PostGIS integration
    • ArangoDB: Built-in geospatial indexes
    • TigerGraph: Requires external PostGIS
  3. Query Performance: Benchmark common queries (5km radius, temporal overlap, quality matching)

  4. Ecosystem: Community size, cloud managed options, integration with existing stack

  5. Cost: Licensing, cloud costs, operational complexity

Relationships

(Business)-[:OPERATES_AT]->(Site)
(Site)-[:HOSTS]->(ResourceFlow)
(ResourceFlow)-[:MATCHABLE_TO {efficiency, distance, savings}]->(ResourceFlow)
(Site)-[:HOSTS]->(SharedAsset)
(Business)-[:OFFERS]->(Service)
(Business)-[:SELLS]->(Product)

Hybrid Architecture for Geospatial Queries

Architecture:

  • Neo4j: Stores graph structure, relationships, quality/temporal properties
  • PostgreSQL+PostGIS: Stores detailed geospatial data, handles complex distance calculations, spatial joins
  • Synchronization: Event-driven sync (Site created/updated → sync to PostGIS)

Query Pattern:

1. PostGIS: Find all sites within 5km radius (fast spatial index)
2. Neo4j: Filter by ResourceFlow types, quality, temporal overlap (graph traversal)
3. Join results in application layer or use Neo4j spatial plugin

Alternative: Use Neo4j APOC spatial procedures if graph is primary store.

Zone-First Architecture for Data Sovereignty

Problem: Global graph vs local adoption conflict - EU-wide matching requires unified schema, but local clusters need low-latency, sovereign data control.

Solution: Zone-first graph architecture where each geographic/regulatory zone operates semi-autonomously:

Zone Types:

  • City Zones: Municipal boundaries, operated by city governments
  • Industrial Park Zones: Single park operators, private industrial clusters
  • Regional Zones: County/state level, cross-municipality coordination
  • Country Zones: National regulatory compliance, standardized schemas

Architecture Pattern:

Zone Database (Local Neo4j/PostgreSQL)
├── Local Graph: Sites, flows, businesses within zone
├── Local Rules: Zone-specific matching logic, regulations
├── Selective Publishing: Choose what to expose globally
└── Data Sovereignty: Zone operator controls data visibility

Global Federation Layer
├── Cross-zone matching requests
├── Federated queries (zone A requests zone B data)
├── Anonymized global analytics
└── Selective data sharing agreements

Key Benefits:

  • Data Sovereignty: Cities/utilities control their data, GDPR compliance
  • Low Latency: Local queries stay within zone boundaries
  • Regulatory Flexibility: Each zone adapts to local waste/energy rules
  • Scalable Adoption: Start with single zones, federate gradually
  • Trust Building: Local operators maintain control while enabling cross-zone matches

Implementation:

  • Zone Registry: Global catalog of active zones with API endpoints
  • Federation Protocol: Standardized cross-zone query interface
  • Data Contracts: Per-zone agreements on what data is shared globally
  • Migration Path: Start mono-zone, add federation as network grows

Indexing Strategy

Required Indexes:

  • Spatial Index: Site locations (latitude, longitude)
  • Temporal Index: ResourceFlow availability windows, seasonality
  • Composite Indexes:
    • (ResourceFlow.type, ResourceFlow.direction, Site.location)
    • (ResourceFlow.quality.temperature_celsius, ResourceFlow.type)
  • Full-Text Search: Business names, NACE codes, service domains

Index Maintenance:

  • Monitor query performance and index usage
  • Use Neo4j's EXPLAIN PROFILE for query optimization
  • Consider partitioning large graphs by geographic regions

Why Graph DB

Queries like: "find all output nodes within 5 km producing heat 3560 °C that matches any input nodes needing heat 3055 °C, ΔT ≤ 10 K, availability overlap ≥ 70 %, and net savings > €0.02/kWh."

That's a multi-criteria graph traversal — perfect fit.