turash/06_data_model_schema_ontology.md at c56ded5d567f82c9ebf70ce9d1dbdf9bcaf58e39

mirror of https://github.com/SamyRai/turash.git synced 2025-12-26 23:01:33 +00:00

Major repository reorganization and missing backend endpoints implementation

Repository Structure:
- Move files from cluttered root directory into organized structure
- Create archive/ for archived data and scraper results
- Create bugulma/ for the complete application (frontend + backend)
- Create data/ for sample datasets and reference materials
- Create docs/ for comprehensive documentation structure
- Create scripts/ for utility scripts and API tools

Backend Implementation:
- Implement 3 missing backend endpoints identified in gap analysis:
  * GET /api/v1/organizations/{id}/matching/direct - Direct symbiosis matches
  * GET /api/v1/users/me/organizations - User organizations
  * POST /api/v1/proposals/{id}/status - Update proposal status
- Add complete proposal domain model, repository, and service layers
- Create database migration for proposals table
- Fix CLI server command registration issue

API Documentation:
- Add comprehensive proposals.md API documentation
- Update README.md with Users and Proposals API sections
- Document all request/response formats, error codes, and business rules

Code Quality:
- Follow existing Go backend architecture patterns
- Add proper error handling and validation
- Match frontend expected response schemas
- Maintain clean separation of concerns (handler -> service -> repository)

2025-11-25 06:01:16 +01:00

14 KiB

Raw Blame History

4. Data Model (Schema / Ontology)

The platform uses a structured data model to represent industrial resources, businesses, sites, and relationships. Each resource is captured with what, how much, quality, when, and where attributes.

Core Schemas

ResourceFlow - Represents resource inputs, outputs, and services at specific sites: See schemas/resource_flow.json for complete schema definition

Site - Represents physical locations and buildings where business activities occur: See schemas/site.json for complete schema definition

Business - Represents legal/commercial entities and their operational capabilities: See schemas/business.json for complete schema definition

SharedAsset - Equipment and infrastructure at specific sites that can be shared among businesses: See schemas/shared_asset.json for complete schema definition

Businesses then just publish ResourceFlow objects to the system.

Neo4j Graph Database Schema

Node Schemas

// Business Node
CREATE CONSTRAINT business_id_unique IF NOT EXISTS
FOR (b:Business) REQUIRE b.id IS UNIQUE;

CREATE CONSTRAINT business_email_unique IF NOT EXISTS
FOR (b:Business) REQUIRE b.email IS UNIQUE;

// Site Node
CREATE CONSTRAINT site_id_unique IF NOT EXISTS
FOR (s:Site) REQUIRE s.id IS UNIQUE;

// ResourceFlow Node
CREATE CONSTRAINT resource_flow_id_unique IF NOT EXISTS
FOR (rf:ResourceFlow) REQUIRE rf.id IS UNIQUE;

// Indexes for performance
CREATE INDEX business_name_index IF NOT EXISTS FOR (b:Business) ON (b.name);
CREATE INDEX site_location_index IF NOT EXISTS FOR (s:Site) ON (s.latitude, s.longitude);
CREATE INDEX resource_flow_type_direction_index IF NOT EXISTS
FOR (rf:ResourceFlow) ON (rf.type, rf.direction);
CREATE INDEX resource_flow_quality_temp_index IF NOT EXISTS
FOR (rf:ResourceFlow) ON (rf.temperature_celsius);

Relationship Schemas

// Core Relationships
CALL apoc.schema.assert({}, {
  'Business-[:OPERATES_AT]->Site': {},
  'Site-[:HOSTS]->ResourceFlow': {},
  'ResourceFlow-[:MATCHABLE_TO]->ResourceFlow': {},
  'Business-[:OFFERS]->Service': {},
  'Business-[:SELLS]->Product': {},
  'Site-[:HOSTS]->SharedAsset': {},
  'Business-[:TRUSTS]->Business': {}
});

Node Property Constraints

// Business Node Properties
CREATE CONSTRAINT business_required_properties IF NOT EXISTS
FOR (b:Business) REQUIRE (b.id, b.name, b.email) IS NOT NULL;

// Site Node Properties
CREATE CONSTRAINT site_required_properties IF NOT EXISTS
FOR (s:Site) REQUIRE (s.id, s.name, s.latitude, s.longitude) IS NOT NULL;

// ResourceFlow Node Properties
CREATE CONSTRAINT resource_flow_required_properties IF NOT EXISTS
FOR (rf:ResourceFlow) REQUIRE (rf.id, rf.business_id, rf.site_id, rf.direction, rf.type) IS NOT NULL;

PostgreSQL Spatial Database Schema

Core Tables

-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "postgis";

-- Business table
CREATE TABLE businesses (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    name VARCHAR(255) NOT NULL,
    legal_form VARCHAR(50),
    primary_contact_email VARCHAR(255) NOT NULL,
    primary_contact_phone VARCHAR(50),
    industrial_sector VARCHAR(10), -- NACE code
    company_size INTEGER,
    years_operation INTEGER,
    supply_chain_role VARCHAR(50),
    certifications JSONB DEFAULT '[]',
    business_focus JSONB DEFAULT '[]',
    strategic_vision TEXT,
    drivers_barriers TEXT,
    readiness_maturity INTEGER CHECK (readiness_maturity BETWEEN 1 AND 5),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Site table with spatial index
CREATE TABLE sites (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    name VARCHAR(255) NOT NULL,
    address TEXT,
    latitude DECIMAL(10,8) NOT NULL,
    longitude DECIMAL(11,8) NOT NULL,
    location GEOGRAPHY(POINT, 4326), -- PostGIS spatial column
    site_type VARCHAR(50),
    floor_area_m2 DECIMAL(12,2),
    ownership VARCHAR(50),
    owner_business_id UUID REFERENCES businesses(id),
    available_utilities JSONB DEFAULT '[]',
    parking_spaces INTEGER,
    loading_docks INTEGER,
    crane_capacity_tonnes DECIMAL(8,2),
    energy_rating VARCHAR(100),
    waste_management JSONB DEFAULT '[]',
    environmental_impact TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Resource flows table
CREATE TABLE resource_flows (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    business_id UUID NOT NULL REFERENCES businesses(id),
    site_id UUID NOT NULL REFERENCES sites(id),
    direction VARCHAR(10) NOT NULL CHECK (direction IN ('input', 'output')),
    type VARCHAR(50) NOT NULL,
    -- Quality parameters
    temperature_celsius DECIMAL(6,2),
    pressure_bar DECIMAL(8,2),
    purity_pct DECIMAL(5,2),
    grade VARCHAR(100),
    hazardousness BOOLEAN DEFAULT FALSE,
    composition TEXT,
    physical_state VARCHAR(20) CHECK (physical_state IN ('solid', 'liquid', 'gas')),
    -- Quantity parameters
    amount DECIMAL(15,4),
    unit VARCHAR(50),
    temporal_unit VARCHAR(50),
    variability DECIMAL(4,3),
    -- Time profile
    availability JSONB DEFAULT '{}',
    seasonality JSONB DEFAULT '[]',
    supply_pattern VARCHAR(50),
    -- Economic data
    cost_in DECIMAL(10,4), -- €/unit
    cost_out DECIMAL(10,4), -- €/unit
    waste_disposal_cost DECIMAL(10,4),
    primary_input_cost DECIMAL(10,4),
    transportation_cost DECIMAL(8,4), -- €/km
    cost_sharing_fraction DECIMAL(3,2),
    -- Constraints
    max_distance_km DECIMAL(8,2),
    requires_permit BOOLEAN DEFAULT FALSE,
    min_quality_threshold TEXT,
    regulatory_compliance BOOLEAN DEFAULT TRUE,
    -- Precision level
    precision_level VARCHAR(20) DEFAULT 'estimated'
        CHECK (precision_level IN ('rough', 'estimated', 'measured')),
    -- Validation
    source_type VARCHAR(50) DEFAULT 'declared'
        CHECK (source_type IN ('declared', 'device', 'calculated')),
    device_signature VARCHAR(255), -- For verified device data
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Versioned resource flows for temporal tracking
CREATE TABLE resource_flow_versions (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    resource_flow_id UUID NOT NULL REFERENCES resource_flows(id),
    version_number INTEGER NOT NULL,
    changes JSONB NOT NULL, -- What changed
    changed_by UUID, -- User who made change
    change_reason TEXT,
    valid_from TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    valid_to TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(resource_flow_id, version_number)
);

Indexes and Constraints

-- Spatial index for location-based queries
CREATE INDEX idx_sites_location ON sites USING GIST (location);

-- Performance indexes
CREATE INDEX idx_resource_flows_business_site ON resource_flows (business_id, site_id);
CREATE INDEX idx_resource_flows_type_direction ON resource_flows (type, direction);
CREATE INDEX idx_resource_flows_temperature ON resource_flows (temperature_celsius);
CREATE INDEX idx_resource_flows_precision ON resource_flows (precision_level);
CREATE INDEX idx_resource_flows_updated ON resource_flows (updated_at);

-- Partial indexes for common queries
CREATE INDEX idx_resource_flows_output_heat ON resource_flows (site_id, temperature_celsius)
WHERE direction = 'output' AND type = 'heat';

CREATE INDEX idx_resource_flows_input_heat ON resource_flows (site_id, temperature_celsius)
WHERE direction = 'input' AND type = 'heat';

-- JSONB indexes for complex queries
CREATE INDEX idx_sites_utilities ON sites USING GIN (available_utilities);
CREATE INDEX idx_businesses_certifications ON businesses USING GIN (certifications);

Data Model Enhancements

Multi-Tenancy Support

Recommendation: Support multi-tenancy from day one, even if single-tenant initially.

Approach: Hybrid

One graph per region/municipality (e.g., one per industrial park or district)
Graph federation for cross-region queries
Tenant_id on all nodes for future cross-tenant analytics (aggregated)

Data Versioning and Audit Trail

Implementation:

Event Sourcing: Store all changes as events
Snapshots: Periodically create snapshots for fast current-state queries
Audit Log: Complete history of who changed what and when

Entities Requiring Versioning:

ResourceFlow (quantity, quality changes affect matches)
Business (certifications, capabilities)
Site (location, infrastructure changes)
Match (status changes: proposed → accepted → implemented)

Temporal Data Handling

Architecture:

Graph Database: Current state, relationships, metadata
Time-Series Database: Historical ResourceFlow measurements, load curves, seasonal patterns
Integration: Sync aggregated time-series data to graph nodes as properties

Use Cases:

Historical analysis: "What was heat demand last winter?"
Pattern detection: Identify recurring availability patterns
Forecasting: Predict future resource availability
Load curves: Real-time meter data integration

Data Quality and Validation

Layers:

Input Validation: Schema validation (JSON Schema, Zod, Pydantic)
Business Logic Validation: Quality ranges, quantity constraints
Cross-Entity Validation: Ensure Site belongs to Business, ResourceFlow belongs to Site
Data Quality Metrics: Completeness, accuracy, consistency scores

Implementation:

Validation at API layer using github.com/go-playground/validator/v10
Background data quality jobs (Go workers)
User-facing data quality dashboard
Automated data quality reports

Data Precision Levels & Privacy Tiers

Precision Levels (allow rough data without blocking adoption):

rough (±50%): Ballpark estimates for initial matching
estimated (±20%): Calculated from known processes
measured (±5%): From meters/IoT devices

Matching Engine: Weighs "measured" matches higher but includes rough estimates in results.

Privacy Tiers & Ownership Model (GDPR-compliant, prevents data resale):

Visibility Matrix by Stakeholder Type:

Individual Companies: See potential matches only (anonymized: "Company X in sector Y, 5km away")
Cities/Municipalities: Aggregate cluster views (total waste heat available, CO₂ potential) - no individual company data
Utilities: Network topology data for planning (pipe routes, capacity) - no commercial pricing
Platform: Raw data for matching only (never shared, GDPR processor role)

Privacy Tiers (differential privacy applied):

public: Resource type, rough quantity range, location cluster (visible to all platform users)
network-only: Detailed specs, pricing (visible only to potential matches after mutual opt-in)
private: Full technical details, ERP feeds (visible only to platform after k-anonymization)

Data Ownership Rules:

Company Data: Companies retain full ownership - can revoke visibility, delete data, export at any time
Utility Data: Platform acts as processor only - utilities cannot resell or commercialize data
Municipal Data: Cities get read-only aggregates - individual facility data never exposed
Platform Data: Matching algorithms and aggregated analytics (anonymized)

GDPR/DPA Compliance Layer:

Legal Basis: Contractual necessity for service provision, legitimate interest for anonymized analytics
Data Processing Agreement: Standard DPA template for all enterprise customers
Anonymization Pipeline: k-anonymity for sensitive flows (minimum 5 similar entities)
Right to Erasure: Full data deletion within 30 days, cascade to all historical matches
Data Portability: Export facility data in structured format (JSON/CSV)

Structured History Storage (Data Moat & Analytics Foundation):

Versioned Resource Profiles: Time-stamped snapshots of all resource declarations with validity periods
Match Attempt Logging: Complete audit trail of all matching attempts, scores, and outcomes
Failure Intelligence Layer: Structured reasons for failed/declined matches (distance, capex, legal, unwilling)
Economic Snapshot Preservation: Frozen prices, volumes, assumptions at time of calculation
Implementation Tracking: Status pipeline (proposed→accepted→technical→economic→legal→capex→operational)

History Storage Architecture:

-- Layer 1: Always versioned (MRV/analytics foundation)
resource_profile_history (
  id, facility_id, resource_type, payload_json,
  valid_from_ts, valid_to_ts, source, quality_flag
)

-- Layer 2: Match intelligence (algorithm training)
match_attempt_log (
  id, timestamp, candidate_a_id, candidate_b_id,
  engine_version, score, outcome, outcome_reason_code
)

-- Layer 3: Event telemetry (optional, aggregate-only)
sensor_snapshot (id, facility_id, timestamp, readings_json)
marketplace_transaction (id, timestamp, buyer_id, seller_id, value)

History Benefits:

MRV Compliance: Prove before/after reductions, time-series validation
Algorithm Improvement: Failed matches inform better scoring models
Policy Intelligence: "If we relax temp limits by 5°C, unlock 11 more matches"
Municipal Analytics: Quarterly CO₂ dashboards with trend analysis
Sales Intelligence: Demonstrate value over time to prospects

Validation Layer:

Device-signed flows: source = device:modbus:123 (trusted, labeled as "verified")
Declared flows: Manual entry (labeled as "estimated," lower matching priority)
Versioned resources: Time-stamped ResourceFlow versions for temporal analysis and delta matching

14 KiB Raw Blame History