docs(phase12): add metering service milestone
Complete Phase 12 Economics & Billing planning with detailed metering service design: - UsageCollector for real-time metric capture from all compute nodes - UsageAggregator for time-windowed billing aggregation - QuotaManager for usage limits and rate limiting - UsageAnalytics for cost optimization insights - Comprehensive data flow and implementation plan
This commit is contained in:
parent
3c9470abba
commit
e20e5cb11f
1 changed files with 355 additions and 0 deletions
|
|
@ -0,0 +1,355 @@
|
|||
# Milestone 2: Metering Service
|
||||
|
||||
> Track and aggregate resource usage across all Synor L2 services for accurate billing
|
||||
|
||||
## Overview
|
||||
|
||||
The Metering Service captures real-time resource consumption across Synor Compute's heterogeneous infrastructure (CPU, GPU, TPU, NPU, LPU, FPGA, WebGPU, WASM). It provides high-fidelity usage data for billing, analytics, and cost optimization.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. UsageCollector
|
||||
|
||||
**Purpose**: Capture granular usage metrics from compute nodes and services.
|
||||
|
||||
**Metrics Tracked**:
|
||||
- **Compute Time**: CPU/GPU/TPU/NPU/LPU/FPGA/WebGPU/WASM seconds
|
||||
- **Memory**: Peak and average RAM/VRAM usage (GB·hours)
|
||||
- **Storage**: Read/write IOPS, bandwidth (GB transferred)
|
||||
- **Network**: Ingress/egress bandwidth (GB)
|
||||
- **Tensor Operations**: FLOPS (floating-point operations per second)
|
||||
- **Model Inference**: Requests, tokens processed, latency percentiles
|
||||
|
||||
**Implementation**:
|
||||
```rust
|
||||
pub struct UsageCollector {
|
||||
metrics_buffer: RingBuffer<UsageMetric>,
|
||||
flush_interval: Duration,
|
||||
redis_client: RedisClient,
|
||||
}
|
||||
|
||||
pub struct UsageMetric {
|
||||
pub timestamp: i64,
|
||||
pub user_id: String,
|
||||
pub resource_type: ResourceType, // CPU, GPU, TPU, etc.
|
||||
pub operation: Operation, // TensorOp, MatMul, Inference, etc.
|
||||
pub quantity: f64, // Compute seconds, FLOPS, tokens, etc.
|
||||
pub metadata: HashMap<String, String>,
|
||||
}
|
||||
|
||||
pub enum ResourceType {
|
||||
CpuCompute,
|
||||
GpuCompute(GpuType), // CUDA, ROCm, Metal, WebGPU
|
||||
TpuCompute,
|
||||
NpuCompute,
|
||||
LpuCompute,
|
||||
FpgaCompute,
|
||||
WasmCompute,
|
||||
Memory,
|
||||
Storage,
|
||||
Network,
|
||||
}
|
||||
|
||||
impl UsageCollector {
|
||||
pub async fn record(&mut self, metric: UsageMetric) -> Result<()> {
|
||||
self.metrics_buffer.push(metric);
|
||||
|
||||
if self.should_flush() {
|
||||
self.flush_to_redis().await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn flush_to_redis(&mut self) -> Result<()> {
|
||||
let batch = self.metrics_buffer.drain();
|
||||
|
||||
for metric in batch {
|
||||
let key = format!("usage:{}:{}", metric.user_id, metric.timestamp);
|
||||
self.redis_client.zadd(key, metric.timestamp, &metric).await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. UsageAggregator
|
||||
|
||||
**Purpose**: Aggregate raw metrics into billable line items.
|
||||
|
||||
**Aggregation Windows**:
|
||||
- **Real-time**: 1-minute windows for live dashboards
|
||||
- **Hourly**: For usage alerts and rate limiting
|
||||
- **Daily**: For invoice generation
|
||||
- **Monthly**: For billing cycles
|
||||
|
||||
**Implementation**:
|
||||
```rust
|
||||
pub struct UsageAggregator {
|
||||
window_size: Duration,
|
||||
pricing_oracle: Arc<PricingOracle>,
|
||||
}
|
||||
|
||||
pub struct AggregatedUsage {
|
||||
pub user_id: String,
|
||||
pub period_start: i64,
|
||||
pub period_end: i64,
|
||||
pub resource_usage: HashMap<ResourceType, ResourceUsage>,
|
||||
pub total_cost_usd: f64,
|
||||
pub total_cost_synor: f64,
|
||||
}
|
||||
|
||||
pub struct ResourceUsage {
|
||||
pub quantity: f64,
|
||||
pub unit: String, // "compute-seconds", "GB", "TFLOPS", etc.
|
||||
pub rate_usd: f64, // USD per unit
|
||||
pub cost_usd: f64, // quantity * rate_usd
|
||||
}
|
||||
|
||||
impl UsageAggregator {
|
||||
pub async fn aggregate_period(
|
||||
&self,
|
||||
user_id: &str,
|
||||
start: i64,
|
||||
end: i64,
|
||||
) -> Result<AggregatedUsage> {
|
||||
let raw_metrics = self.fetch_metrics(user_id, start, end).await?;
|
||||
let mut resource_usage = HashMap::new();
|
||||
|
||||
for metric in raw_metrics {
|
||||
let entry = resource_usage
|
||||
.entry(metric.resource_type)
|
||||
.or_insert(ResourceUsage::default());
|
||||
|
||||
entry.quantity += metric.quantity;
|
||||
}
|
||||
|
||||
// Apply pricing
|
||||
let mut total_cost_usd = 0.0;
|
||||
for (resource_type, usage) in resource_usage.iter_mut() {
|
||||
usage.rate_usd = self.pricing_oracle.get_rate(resource_type).await?;
|
||||
usage.cost_usd = usage.quantity * usage.rate_usd;
|
||||
total_cost_usd += usage.cost_usd;
|
||||
}
|
||||
|
||||
let synor_price = self.pricing_oracle.get_synor_price_usd().await?;
|
||||
let total_cost_synor = total_cost_usd / synor_price;
|
||||
|
||||
Ok(AggregatedUsage {
|
||||
user_id: user_id.to_string(),
|
||||
period_start: start,
|
||||
period_end: end,
|
||||
resource_usage,
|
||||
total_cost_usd,
|
||||
total_cost_synor,
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. UsageQuotaManager
|
||||
|
||||
**Purpose**: Enforce usage limits and prevent abuse.
|
||||
|
||||
**Features**:
|
||||
- Per-user compute quotas (e.g., 100 GPU-hours/month)
|
||||
- Rate limiting (e.g., 1000 requests/minute)
|
||||
- Burst allowances with token bucket algorithm
|
||||
- Soft limits (warnings) vs hard limits (throttling)
|
||||
|
||||
**Implementation**:
|
||||
```rust
|
||||
pub struct QuotaManager {
|
||||
redis_client: RedisClient,
|
||||
}
|
||||
|
||||
pub struct Quota {
|
||||
pub resource_type: ResourceType,
|
||||
pub limit: f64,
|
||||
pub period: Duration,
|
||||
pub current_usage: f64,
|
||||
pub reset_at: i64,
|
||||
}
|
||||
|
||||
impl QuotaManager {
|
||||
pub async fn check_quota(
|
||||
&self,
|
||||
user_id: &str,
|
||||
resource_type: ResourceType,
|
||||
requested_amount: f64,
|
||||
) -> Result<QuotaStatus> {
|
||||
let quota = self.get_quota(user_id, &resource_type).await?;
|
||||
|
||||
if quota.current_usage + requested_amount > quota.limit {
|
||||
return Ok(QuotaStatus::Exceeded {
|
||||
limit: quota.limit,
|
||||
current: quota.current_usage,
|
||||
requested: requested_amount,
|
||||
reset_at: quota.reset_at,
|
||||
});
|
||||
}
|
||||
|
||||
Ok(QuotaStatus::Available)
|
||||
}
|
||||
|
||||
pub async fn consume_quota(
|
||||
&mut self,
|
||||
user_id: &str,
|
||||
resource_type: ResourceType,
|
||||
amount: f64,
|
||||
) -> Result<()> {
|
||||
let key = format!("quota:{}:{:?}", user_id, resource_type);
|
||||
self.redis_client.incr_by(key, amount).await?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4. UsageAnalytics
|
||||
|
||||
**Purpose**: Provide insights for cost optimization and capacity planning.
|
||||
|
||||
**Dashboards**:
|
||||
- **User Dashboard**: Real-time usage, cost trends, quota status
|
||||
- **Admin Dashboard**: Top consumers, resource utilization, anomaly detection
|
||||
- **Forecast Dashboard**: Projected costs, growth trends
|
||||
|
||||
**Metrics**:
|
||||
```rust
|
||||
pub struct UsageAnalytics {
|
||||
timeseries_db: TimeseriesDB,
|
||||
}
|
||||
|
||||
pub struct CostTrend {
|
||||
pub timestamps: Vec<i64>,
|
||||
pub costs: Vec<f64>,
|
||||
pub resource_breakdown: HashMap<ResourceType, Vec<f64>>,
|
||||
}
|
||||
|
||||
impl UsageAnalytics {
|
||||
pub async fn get_cost_trend(
|
||||
&self,
|
||||
user_id: &str,
|
||||
window: Duration,
|
||||
) -> Result<CostTrend> {
|
||||
// Query timeseries DB for aggregated usage
|
||||
// Generate cost trends and resource breakdowns
|
||||
todo!()
|
||||
}
|
||||
|
||||
pub async fn detect_anomalies(
|
||||
&self,
|
||||
user_id: &str,
|
||||
) -> Result<Vec<UsageAnomaly>> {
|
||||
// Statistical anomaly detection (z-score, IQR)
|
||||
// Notify on sudden spikes or unusual patterns
|
||||
todo!()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌──────────────────┐
|
||||
│ Compute Nodes │
|
||||
│ (CPU/GPU/TPU...) │
|
||||
└────────┬─────────┘
|
||||
│ emit metrics
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ UsageCollector │
|
||||
│ (batch buffer) │
|
||||
└────────┬─────────┘
|
||||
│ flush every 1s
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ Redis Stream │
|
||||
│ (raw metrics) │
|
||||
└────────┬─────────┘
|
||||
│ consume
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ UsageAggregator │
|
||||
│ (time windows) │
|
||||
└────────┬─────────┘
|
||||
│ write
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ Timeseries DB │
|
||||
│ (InfluxDB/VictoriaMetrics)
|
||||
└────────┬─────────┘
|
||||
│ query
|
||||
↓
|
||||
┌──────────────────┐
|
||||
│ BillingEngine │
|
||||
│ (invoice gen) │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Core Metrics Collection (Week 1-2)
|
||||
- [ ] Implement UsageCollector in Rust
|
||||
- [ ] Integrate with Synor Compute worker nodes
|
||||
- [ ] Set up Redis streams for metric buffering
|
||||
- [ ] Add metric collection to all compute operations
|
||||
|
||||
### Phase 2: Aggregation & Storage (Week 3-4)
|
||||
- [ ] Implement UsageAggregator with hourly/daily windows
|
||||
- [ ] Deploy InfluxDB or VictoriaMetrics for timeseries storage
|
||||
- [ ] Create aggregation jobs (cron or stream processors)
|
||||
- [ ] Build query APIs for usage data
|
||||
|
||||
### Phase 3: Quota Management (Week 5-6)
|
||||
- [ ] Implement QuotaManager with Redis-backed quotas
|
||||
- [ ] Add quota checks to orchestrator before job dispatch
|
||||
- [ ] Implement rate limiting with token bucket algorithm
|
||||
- [ ] Create admin UI for setting user quotas
|
||||
|
||||
### Phase 4: Analytics & Dashboards (Week 7-8)
|
||||
- [ ] Build UsageAnalytics module
|
||||
- [ ] Create user dashboard (Next.js + Chart.js)
|
||||
- [ ] Add cost trend visualization
|
||||
- [ ] Implement anomaly detection alerts
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- Metric recording and buffering
|
||||
- Aggregation window calculations
|
||||
- Quota enforcement logic
|
||||
- Anomaly detection algorithms
|
||||
|
||||
### Integration Tests
|
||||
- End-to-end metric flow (collector → Redis → aggregator → DB)
|
||||
- Quota limits preventing over-consumption
|
||||
- Pricing oracle integration for cost calculations
|
||||
|
||||
### Load Tests
|
||||
- 10,000 metrics/second ingestion
|
||||
- Aggregation performance with 1M+ metrics
|
||||
- Query latency for large time windows
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- **Accuracy**: <1% discrepancy between raw metrics and billable amounts
|
||||
- **Latency**: <100ms p99 for metric recording
|
||||
- **Throughput**: 100,000 metrics/second ingestion
|
||||
- **Retention**: 1-year historical data for analytics
|
||||
- **Uptime**: 99.9% availability for metering service
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Data Integrity**: HMAC signatures on metrics to prevent tampering
|
||||
- **Access Control**: User can only query their own usage data
|
||||
- **Audit Logs**: Track all quota changes and metric adjustments
|
||||
- **Rate Limiting**: Prevent abuse of analytics APIs
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
- **Batch Processing**: Group metrics into 1-second batches to reduce Redis ops
|
||||
- **Compression**: Use columnar storage for timeseries data
|
||||
- **TTL Policies**: Auto-expire raw metrics after 7 days (keep aggregated data)
|
||||
- **Caching**: Cache quota values for 60 seconds to reduce Redis load
|
||||
Loading…
Add table
Reference in a new issue