docs(phase12): add metering service milestone

Complete Phase 12 Economics & Billing planning with detailed metering service design: - UsageCollector for real-time metric capture from all compute nodes - UsageAggregator for time-windowed billing aggregation - QuotaManager for usage limits and rate limiting - UsageAnalytics for cost optimization insights - Comprehensive data flow and implementation plan
2026-01-19 03:01:39 +05:30 · 2026-01-19 03:01:39 +05:30 · e20e5cb11f
commit e20e5cb11f
parent 3c9470abba
1 changed files with 355 additions and 0 deletions
--- a/docs/PLAN/PHASE12-EconomicsBilling/01-Milestone-02-MeteringService.md
+++ b/docs/PLAN/PHASE12-EconomicsBilling/01-Milestone-02-MeteringService.md
@ -0,0 +1,355 @@
+# Milestone 2: Metering Service
+
+> Track and aggregate resource usage across all Synor L2 services for accurate billing
+
+## Overview
+
+The Metering Service captures real-time resource consumption across Synor Compute's heterogeneous infrastructure (CPU, GPU, TPU, NPU, LPU, FPGA, WebGPU, WASM). It provides high-fidelity usage data for billing, analytics, and cost optimization.
+
+## Components
+
+### 1. UsageCollector
+
+**Purpose**: Capture granular usage metrics from compute nodes and services.
+
+**Metrics Tracked**:
+- **Compute Time**: CPU/GPU/TPU/NPU/LPU/FPGA/WebGPU/WASM seconds
+- **Memory**: Peak and average RAM/VRAM usage (GB·hours)
+- **Storage**: Read/write IOPS, bandwidth (GB transferred)
+- **Network**: Ingress/egress bandwidth (GB)
+- **Tensor Operations**: FLOPS (floating-point operations per second)
+- **Model Inference**: Requests, tokens processed, latency percentiles
+
+**Implementation**:
+```rust
+pub struct UsageCollector {
+    metrics_buffer: RingBuffer<UsageMetric>,
+    flush_interval: Duration,
+    redis_client: RedisClient,
+}
+
+pub struct UsageMetric {
+    pub timestamp: i64,
+    pub user_id: String,
+    pub resource_type: ResourceType, // CPU, GPU, TPU, etc.
+    pub operation: Operation,        // TensorOp, MatMul, Inference, etc.
+    pub quantity: f64,               // Compute seconds, FLOPS, tokens, etc.
+    pub metadata: HashMap<String, String>,
+}
+
+pub enum ResourceType {
+    CpuCompute,
+    GpuCompute(GpuType),  // CUDA, ROCm, Metal, WebGPU
+    TpuCompute,
+    NpuCompute,
+    LpuCompute,
+    FpgaCompute,
+    WasmCompute,
+    Memory,
+    Storage,
+    Network,
+}
+
+impl UsageCollector {
+    pub async fn record(&mut self, metric: UsageMetric) -> Result<()> {
+        self.metrics_buffer.push(metric);
+
+        if self.should_flush() {
+            self.flush_to_redis().await?;
+        }
+
+        Ok(())
+    }
+
+    async fn flush_to_redis(&mut self) -> Result<()> {
+        let batch = self.metrics_buffer.drain();
+
+        for metric in batch {
+            let key = format!("usage:{}:{}", metric.user_id, metric.timestamp);
+            self.redis_client.zadd(key, metric.timestamp, &metric).await?;
+        }
+
+        Ok(())
+    }
+}
+```
+
+### 2. UsageAggregator
+
+**Purpose**: Aggregate raw metrics into billable line items.
+
+**Aggregation Windows**:
+- **Real-time**: 1-minute windows for live dashboards
+- **Hourly**: For usage alerts and rate limiting
+- **Daily**: For invoice generation
+- **Monthly**: For billing cycles
+
+**Implementation**:
+```rust
+pub struct UsageAggregator {
+    window_size: Duration,
+    pricing_oracle: Arc<PricingOracle>,
+}
+
+pub struct AggregatedUsage {
+    pub user_id: String,
+    pub period_start: i64,
+    pub period_end: i64,
+    pub resource_usage: HashMap<ResourceType, ResourceUsage>,
+    pub total_cost_usd: f64,
+    pub total_cost_synor: f64,
+}
+
+pub struct ResourceUsage {
+    pub quantity: f64,
+    pub unit: String,          // "compute-seconds", "GB", "TFLOPS", etc.
+    pub rate_usd: f64,         // USD per unit
+    pub cost_usd: f64,         // quantity * rate_usd
+}
+
+impl UsageAggregator {
+    pub async fn aggregate_period(
+        &self,
+        user_id: &str,
+        start: i64,
+        end: i64,
+    ) -> Result<AggregatedUsage> {
+        let raw_metrics = self.fetch_metrics(user_id, start, end).await?;
+        let mut resource_usage = HashMap::new();
+
+        for metric in raw_metrics {
+            let entry = resource_usage
+                .entry(metric.resource_type)
+                .or_insert(ResourceUsage::default());
+
+            entry.quantity += metric.quantity;
+        }
+
+        // Apply pricing
+        let mut total_cost_usd = 0.0;
+        for (resource_type, usage) in resource_usage.iter_mut() {
+            usage.rate_usd = self.pricing_oracle.get_rate(resource_type).await?;
+            usage.cost_usd = usage.quantity * usage.rate_usd;
+            total_cost_usd += usage.cost_usd;
+        }
+
+        let synor_price = self.pricing_oracle.get_synor_price_usd().await?;
+        let total_cost_synor = total_cost_usd / synor_price;
+
+        Ok(AggregatedUsage {
+            user_id: user_id.to_string(),
+            period_start: start,
+            period_end: end,
+            resource_usage,
+            total_cost_usd,
+            total_cost_synor,
+        })
+    }
+}
+```
+
+### 3. UsageQuotaManager
+
+**Purpose**: Enforce usage limits and prevent abuse.
+
+**Features**:
+- Per-user compute quotas (e.g., 100 GPU-hours/month)
+- Rate limiting (e.g., 1000 requests/minute)
+- Burst allowances with token bucket algorithm
+- Soft limits (warnings) vs hard limits (throttling)
+
+**Implementation**:
+```rust
+pub struct QuotaManager {
+    redis_client: RedisClient,
+}
+
+pub struct Quota {
+    pub resource_type: ResourceType,
+    pub limit: f64,
+    pub period: Duration,
+    pub current_usage: f64,
+    pub reset_at: i64,
+}
+
+impl QuotaManager {
+    pub async fn check_quota(
+        &self,
+        user_id: &str,
+        resource_type: ResourceType,
+        requested_amount: f64,
+    ) -> Result<QuotaStatus> {
+        let quota = self.get_quota(user_id, &resource_type).await?;
+
+        if quota.current_usage + requested_amount > quota.limit {
+            return Ok(QuotaStatus::Exceeded {
+                limit: quota.limit,
+                current: quota.current_usage,
+                requested: requested_amount,
+                reset_at: quota.reset_at,
+            });
+        }
+
+        Ok(QuotaStatus::Available)
+    }
+
+    pub async fn consume_quota(
+        &mut self,
+        user_id: &str,
+        resource_type: ResourceType,
+        amount: f64,
+    ) -> Result<()> {
+        let key = format!("quota:{}:{:?}", user_id, resource_type);
+        self.redis_client.incr_by(key, amount).await?;
+        Ok(())
+    }
+}
+```
+
+### 4. UsageAnalytics
+
+**Purpose**: Provide insights for cost optimization and capacity planning.
+
+**Dashboards**:
+- **User Dashboard**: Real-time usage, cost trends, quota status
+- **Admin Dashboard**: Top consumers, resource utilization, anomaly detection
+- **Forecast Dashboard**: Projected costs, growth trends
+
+**Metrics**:
+```rust
+pub struct UsageAnalytics {
+    timeseries_db: TimeseriesDB,
+}
+
+pub struct CostTrend {
+    pub timestamps: Vec<i64>,
+    pub costs: Vec<f64>,
+    pub resource_breakdown: HashMap<ResourceType, Vec<f64>>,
+}
+
+impl UsageAnalytics {
+    pub async fn get_cost_trend(
+        &self,
+        user_id: &str,
+        window: Duration,
+    ) -> Result<CostTrend> {
+        // Query timeseries DB for aggregated usage
+        // Generate cost trends and resource breakdowns
+        todo!()
+    }
+
+    pub async fn detect_anomalies(
+        &self,
+        user_id: &str,
+    ) -> Result<Vec<UsageAnomaly>> {
+        // Statistical anomaly detection (z-score, IQR)
+        // Notify on sudden spikes or unusual patterns
+        todo!()
+    }
+}
+```
+
+## Data Flow
+
+```
+┌──────────────────┐
+│  Compute Nodes   │
+│ (CPU/GPU/TPU...) │
+└────────┬─────────┘
+         │ emit metrics
+         ↓
+┌──────────────────┐
+│ UsageCollector   │
+│  (batch buffer)  │
+└────────┬─────────┘
+         │ flush every 1s
+         ↓
+┌──────────────────┐
+│  Redis Stream    │
+│ (raw metrics)    │
+└────────┬─────────┘
+         │ consume
+         ↓
+┌──────────────────┐
+│ UsageAggregator  │
+│ (time windows)   │
+└────────┬─────────┘
+         │ write
+         ↓
+┌──────────────────┐
+│  Timeseries DB   │
+│ (InfluxDB/VictoriaMetrics)
+└────────┬─────────┘
+         │ query
+         ↓
+┌──────────────────┐
+│ BillingEngine    │
+│ (invoice gen)    │
+└──────────────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Core Metrics Collection (Week 1-2)
+- [ ] Implement UsageCollector in Rust
+- [ ] Integrate with Synor Compute worker nodes
+- [ ] Set up Redis streams for metric buffering
+- [ ] Add metric collection to all compute operations
+
+### Phase 2: Aggregation & Storage (Week 3-4)
+- [ ] Implement UsageAggregator with hourly/daily windows
+- [ ] Deploy InfluxDB or VictoriaMetrics for timeseries storage
+- [ ] Create aggregation jobs (cron or stream processors)
+- [ ] Build query APIs for usage data
+
+### Phase 3: Quota Management (Week 5-6)
+- [ ] Implement QuotaManager with Redis-backed quotas
+- [ ] Add quota checks to orchestrator before job dispatch
+- [ ] Implement rate limiting with token bucket algorithm
+- [ ] Create admin UI for setting user quotas
+
+### Phase 4: Analytics & Dashboards (Week 7-8)
+- [ ] Build UsageAnalytics module
+- [ ] Create user dashboard (Next.js + Chart.js)
+- [ ] Add cost trend visualization
+- [ ] Implement anomaly detection alerts
+
+## Testing Strategy
+
+### Unit Tests
+- Metric recording and buffering
+- Aggregation window calculations
+- Quota enforcement logic
+- Anomaly detection algorithms
+
+### Integration Tests
+- End-to-end metric flow (collector → Redis → aggregator → DB)
+- Quota limits preventing over-consumption
+- Pricing oracle integration for cost calculations
+
+### Load Tests
+- 10,000 metrics/second ingestion
+- Aggregation performance with 1M+ metrics
+- Query latency for large time windows
+
+## Success Metrics
+
+- **Accuracy**: <1% discrepancy between raw metrics and billable amounts
+- **Latency**: <100ms p99 for metric recording
+- **Throughput**: 100,000 metrics/second ingestion
+- **Retention**: 1-year historical data for analytics
+- **Uptime**: 99.9% availability for metering service
+
+## Security Considerations
+
+- **Data Integrity**: HMAC signatures on metrics to prevent tampering
+- **Access Control**: User can only query their own usage data
+- **Audit Logs**: Track all quota changes and metric adjustments
+- **Rate Limiting**: Prevent abuse of analytics APIs
+
+## Cost Optimization
+
+- **Batch Processing**: Group metrics into 1-second batches to reduce Redis ops
+- **Compression**: Use columnar storage for timeseries data
+- **TTL Policies**: Auto-expire raw metrics after 7 days (keep aggregated data)
+- **Caching**: Cache quota values for 60 seconds to reduce Redis load