synor/docs/PLAN/PHASE11-Synor-Compute-L2-Part3-HeterogeneousCompute.md

# Phase 11 Part 3: Heterogeneous Multi-Processor Compute

> **Goal**: Utilize ALL processor types simultaneously (CPU+GPU+TPU+NPU+LPU+Custom) with intelligent task scheduling to achieve maximum throughput and zero idle processors.

---

## Executive Summary

Modern compute workloads can be decomposed into subtasks that are optimal for different processor types:

| Processor | Optimal For | Examples |
|-----------|-------------|----------|
| **CPU** | Sequential logic, control flow, I/O | Data loading, preprocessing, orchestration |
| **GPU** | Parallel matrix operations | Neural network layers, convolutions |
| **TPU** | Tensor operations, ML inference | Transformer attention, matrix multiply |
| **NPU** | Low-power inference | Edge inference, mobile AI |
| **LPU** | Sequential inference (Groq) | LLM token generation |
| **FPGA** | Custom bit-level operations | Cryptography, specialized kernels |
| **DSP** | Signal processing | Audio, video, sensor data |

**Key Insight**: A single AI training job contains ALL these subtask types. By routing each subtask to the optimal processor, we achieve **2-5x speedup** over GPU-only execution.

---

## Architecture: Unified Heterogeneous Scheduler

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SYNOR HETEROGENEOUS COMPUTE ENGINE                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                      TASK DECOMPOSER                                     │ │
│  │  Analyzes workload → Identifies subtasks → Maps to optimal processors    │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                    HETEROGENEOUS SCHEDULER                               │ │
│  │  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐        │ │
│  │  │ CPU  │ │ GPU  │ │ TPU  │ │ NPU  │ │ LPU  │ │ FPGA │ │ DSP  │        │ │
│  │  │Queue │ │Queue │ │Queue │ │Queue │ │Queue │ │Queue │ │Queue │        │ │
│  │  └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘        │ │
│  └─────┼────────┼────────┼────────┼────────┼────────┼────────┼────────────┘ │
│        │        │        │        │        │        │        │              │
│        ▼        ▼        ▼        ▼        ▼        ▼        ▼              │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                    PROCESSOR FABRIC                                      │ │
│  │                                                                          │ │
│  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐            │ │
│  │  │  CPU   │  │  GPU   │  │  TPU   │  │  NPU   │  │  LPU   │            │ │
│  │  │Cluster │  │Cluster │  │ Pods   │  │ Array  │  │ Rack   │            │ │
│  │  │        │  │        │  │        │  │        │  │        │            │ │
│  │  │ ┌────┐ │  │ ┌────┐ │  │ ┌────┐ │  │ ┌────┐ │  │ ┌────┐ │            │ │
│  │  │ │Core│ │  │ │CUDA│ │  │ │MXU │ │  │ │ NPE│ │  │ │TSP │ │            │ │
│  │  │ │Core│ │  │ │CUDA│ │  │ │MXU │ │  │ │ NPE│ │  │ │TSP │ │            │ │
│  │  │ │Core│ │  │ │CUDA│ │  │ │MXU │ │  │ │ NPE│ │  │ │TSP │ │            │ │
│  │  │ └────┘ │  │ └────┘ │  │ └────┘ │  │ └────┘ │  │ └────┘ │            │ │
│  │  └────────┘  └────────┘  └────────┘  └────────┘  └────────┘            │ │
│  │                                                                          │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                    UNIFIED MEMORY FABRIC                                 │ │
│  │  Zero-copy data sharing │ Automatic placement │ Cache coherency          │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Part 1: Processor Type Definitions

### 1.1 Unified Processor Abstraction

```rust
// synor-compute/src/heterogeneous/processor.rs

/// Unified abstraction for any processor type
pub trait Processor: Send + Sync {
    /// Processor type identifier
    fn processor_type(&self) -> ProcessorType;

    /// Get capabilities
    fn capabilities(&self) -> &ProcessorCapabilities;

    /// Check if processor can execute operation
    fn can_execute(&self, op: &Operation) -> bool;

    /// Estimate execution time for operation
    fn estimate_time(&self, op: &Operation) -> Duration;

    /// Estimate energy consumption for operation
    fn estimate_energy(&self, op: &Operation) -> f64; // Joules

    /// Execute operation
    async fn execute(&self, op: Operation) -> Result<OperationResult, ProcessorError>;

    /// Current utilization (0.0 - 1.0)
    fn utilization(&self) -> f64;

    /// Available memory
    fn available_memory(&self) -> u64;
}

/// All supported processor types
#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum ProcessorType {
    /// Central Processing Unit
    Cpu(CpuVariant),
    /// Graphics Processing Unit
    Gpu(GpuVariant),
    /// Tensor Processing Unit (Google)
    Tpu(TpuVersion),
    /// Neural Processing Unit (various vendors)
    Npu(NpuVariant),
    /// Language Processing Unit (Groq)
    Lpu,
    /// Field Programmable Gate Array
    Fpga(FpgaVendor),
    /// Digital Signal Processor
    Dsp(DspVariant),
    /// Custom/Unknown Accelerator
    Custom { vendor: String, model: String },
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum CpuVariant {
    X86_64 { avx: AvxSupport },
    Arm64 { sve: bool },
    RiscV { vector: bool },
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum GpuVariant {
    NvidiaCuda { compute_capability: (u8, u8) },
    AmdRocm { gfx_version: u32 },
    IntelOneApi,
    AppleMetal,
    QualcommAdreno,
    ArmMali,
    WebGpu,
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum TpuVersion {
    V2, V3, V4, V4i, V5e, V5p,
    EdgeTpu,
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum NpuVariant {
    AppleNeuralEngine { cores: u32 },
    QualcommHexagon { version: u32 },
    IntelVpu,
    HuaweiAscend,
    GoogleEdgeTpu,
    Custom { tops: f32 },
}
```

### 1.2 Processor Capabilities

```rust
// synor-compute/src/heterogeneous/capabilities.rs

/// Detailed processor capabilities
#[derive(Clone, Debug)]
pub struct ProcessorCapabilities {
    /// Compute throughput
    pub compute: ComputeThroughput,
    /// Memory specs
    pub memory: MemorySpecs,
    /// Supported operations
    pub operations: HashSet<OperationType>,
    /// Supported data types
    pub data_types: HashSet<DataType>,
    /// Power characteristics
    pub power: PowerCharacteristics,
    /// Optimal workload characteristics
    pub optimal_for: Vec<WorkloadCharacteristic>,
}

#[derive(Clone, Debug)]
pub struct ComputeThroughput {
    /// FP64 TFLOPS
    pub fp64_tflops: f64,
    /// FP32 TFLOPS
    pub fp32_tflops: f64,
    /// FP16 TFLOPS
    pub fp16_tflops: f64,
    /// BF16 TFLOPS
    pub bf16_tflops: f64,
    /// INT8 TOPS
    pub int8_tops: f64,
    /// INT4 TOPS
    pub int4_tops: f64,
    /// Sparse operations multiplier
    pub sparsity_speedup: f64,
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum OperationType {
    // Matrix operations
    MatMul,
    Conv2d,
    Conv3d,
    DepthwiseConv,
    BatchNorm,
    LayerNorm,
    // Attention operations
    SelfAttention,
    CrossAttention,
    FlashAttention,
    // Activation functions
    ReLU,
    GeLU,
    SiLU,
    Softmax,
    // Reduction operations
    Sum,
    Mean,
    Max,
    ArgMax,
    // Data movement
    Transpose,
    Reshape,
    Concat,
    Split,
    Gather,
    Scatter,
    // Special operations
    Embedding,
    RoPE,  // Rotary Position Embedding
    KVCache,
    TopK,
    Sampling,
    // I/O operations
    DataLoad,
    DataPreprocess,
    Tokenization,
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum WorkloadCharacteristic {
    /// High parallelism (GPU, TPU)
    HighlyParallel,
    /// Sequential dependencies (CPU, LPU)
    Sequential,
    /// Memory bandwidth bound (GPU)
    MemoryBound,
    /// Compute bound (TPU)
    ComputeBound,
    /// Low latency required (NPU, edge)
    LowLatency,
    /// Low power required (NPU, mobile)
    LowPower,
    /// Large batch sizes (GPU, TPU)
    LargeBatch,
    /// Small batch sizes (CPU, LPU)
    SmallBatch,
    /// Variable length sequences (LPU)
    VariableLength,
    /// Fixed tensor shapes (TPU)
    FixedShape,
}
```

### 1.3 Processor Profiles by Type

```rust
// synor-compute/src/heterogeneous/profiles.rs

/// Pre-defined processor profiles
pub struct ProcessorProfiles;

impl ProcessorProfiles {
    /// NVIDIA H100 SXM profile
    pub fn nvidia_h100() -> ProcessorCapabilities {
        ProcessorCapabilities {
            compute: ComputeThroughput {
                fp64_tflops: 67.0,
                fp32_tflops: 67.0,
                fp16_tflops: 1979.0,  // With sparsity
                bf16_tflops: 1979.0,
                int8_tops: 3958.0,
                int4_tops: 7916.0,
                sparsity_speedup: 2.0,
            },
            memory: MemorySpecs {
                capacity_gb: 80,
                bandwidth_gbps: 3350,
                type_: MemoryType::Hbm3,
            },
            operations: [
                OperationType::MatMul,
                OperationType::Conv2d,
                OperationType::SelfAttention,
                OperationType::FlashAttention,
                // ... all GPU operations
            ].into_iter().collect(),
            optimal_for: vec![
                WorkloadCharacteristic::HighlyParallel,
                WorkloadCharacteristic::LargeBatch,
                WorkloadCharacteristic::ComputeBound,
            ],
            ..Default::default()
        }
    }

    /// Google TPU v5p profile
    pub fn google_tpu_v5p() -> ProcessorCapabilities {
        ProcessorCapabilities {
            compute: ComputeThroughput {
                fp32_tflops: 459.0,
                bf16_tflops: 918.0,
                int8_tops: 1836.0,
                ..Default::default()
            },
            memory: MemorySpecs {
                capacity_gb: 95,
                bandwidth_gbps: 4800,
                type_: MemoryType::Hbm2e,
            },
            optimal_for: vec![
                WorkloadCharacteristic::HighlyParallel,
                WorkloadCharacteristic::ComputeBound,
                WorkloadCharacteristic::FixedShape,
                WorkloadCharacteristic::LargeBatch,
            ],
            ..Default::default()
        }
    }

    /// Groq LPU profile
    pub fn groq_lpu() -> ProcessorCapabilities {
        ProcessorCapabilities {
            compute: ComputeThroughput {
                int8_tops: 750.0,
                ..Default::default()
            },
            memory: MemorySpecs {
                capacity_gb: 230,  // SRAM!
                bandwidth_gbps: 80_000,  // 80 TB/s internal
                type_: MemoryType::Sram,
            },
            optimal_for: vec![
                WorkloadCharacteristic::Sequential,
                WorkloadCharacteristic::SmallBatch,
                WorkloadCharacteristic::VariableLength,
                WorkloadCharacteristic::LowLatency,
            ],
            ..Default::default()
        }
    }

    /// Apple M3 Max Neural Engine profile
    pub fn apple_neural_engine_m3() -> ProcessorCapabilities {
        ProcessorCapabilities {
            compute: ComputeThroughput {
                int8_tops: 18.0,
                ..Default::default()
            },
            memory: MemorySpecs {
                capacity_gb: 0,  // Uses unified memory
                bandwidth_gbps: 400,
                type_: MemoryType::Unified,
            },
            optimal_for: vec![
                WorkloadCharacteristic::LowPower,
                WorkloadCharacteristic::LowLatency,
                WorkloadCharacteristic::SmallBatch,
            ],
            ..Default::default()
        }
    }

    /// AMD EPYC 9654 CPU profile
    pub fn amd_epyc_9654() -> ProcessorCapabilities {
        ProcessorCapabilities {
            compute: ComputeThroughput {
                fp64_tflops: 5.4,
                fp32_tflops: 10.8,
                ..Default::default()
            },
            memory: MemorySpecs {
                capacity_gb: 6144,  // 6TB max
                bandwidth_gbps: 460,
                type_: MemoryType::Ddr5,
            },
            operations: [
                OperationType::DataLoad,
                OperationType::DataPreprocess,
                OperationType::Tokenization,
                // Sequential operations
            ].into_iter().collect(),
            optimal_for: vec![
                WorkloadCharacteristic::Sequential,
                WorkloadCharacteristic::MemoryBound,
            ],
            ..Default::default()
        }
    }
}
```

---

## Part 2: Task Decomposition Engine

### 2.1 Workload Analyzer

```rust
// synor-compute/src/heterogeneous/analyzer.rs

/// Analyzes workloads and decomposes into optimal subtasks
pub struct WorkloadAnalyzer {
    /// Operation cost models for each processor type
    cost_models: HashMap<ProcessorType, OperationCostModel>,
    /// Dependency graph builder
    graph_builder: DependencyGraphBuilder,
    /// ML model for workload prediction
    predictor: WorkloadPredictor,
}

impl WorkloadAnalyzer {
    /// Analyze a computation graph and decompose into subtasks
    pub async fn analyze(&self, graph: &ComputationGraph) -> WorkloadAnalysis {
        // 1. Build dependency graph
        let deps = self.graph_builder.build(graph);

        // 2. Identify operation types
        let operations = self.identify_operations(graph);

        // 3. Estimate costs for each processor type
        let cost_matrix = self.estimate_costs(&operations);

        // 4. Find optimal assignment
        let assignment = self.optimize_assignment(&deps, &cost_matrix);

        // 5. Create execution plan
        WorkloadAnalysis {
            operations,
            dependencies: deps,
            cost_matrix,
            optimal_assignment: assignment,
            estimated_speedup: self.calculate_speedup(&assignment),
        }
    }

    /// Estimate operation costs across all processor types
    fn estimate_costs(&self, operations: &[Operation]) -> CostMatrix {
        let mut matrix = CostMatrix::new(operations.len(), self.cost_models.len());

        for (op_idx, op) in operations.iter().enumerate() {
            for (proc_idx, (proc_type, model)) in self.cost_models.iter().enumerate() {
                let cost = if model.can_execute(op) {
                    model.estimate_cost(op)
                } else {
                    f64::INFINITY  // Can't execute on this processor
                };
                matrix.set(op_idx, proc_idx, cost);
            }
        }

        matrix
    }

    /// Optimize task-to-processor assignment
    fn optimize_assignment(
        &self,
        deps: &DependencyGraph,
        costs: &CostMatrix,
    ) -> TaskAssignment {
        // Use ILP (Integer Linear Programming) or heuristic
        // to minimize total execution time considering:
        // 1. Operation costs on each processor
        // 2. Data transfer costs between processors
        // 3. Dependency constraints (ordering)
        // 4. Processor capacity constraints

        let solver = HeterogeneousSchedulingSolver::new();
        solver.solve(deps, costs)
    }
}

/// Cost matrix: operations × processor types
pub struct CostMatrix {
    /// Rows: operations, Cols: processor types
    data: Vec<Vec<f64>>,
    /// Data transfer costs between processor types
    transfer_costs: HashMap<(ProcessorType, ProcessorType), f64>,
}

impl CostMatrix {
    /// Get cost of operation on processor
    pub fn get(&self, op: usize, proc: usize) -> f64 {
        self.data[op][proc]
    }

    /// Get data transfer cost between processors
    pub fn transfer_cost(&self, from: ProcessorType, to: ProcessorType, bytes: u64) -> f64 {
        if from == to {
            0.0  // Same processor type, no transfer
        } else {
            let per_byte = self.transfer_costs
                .get(&(from, to))
                .unwrap_or(&1e-9);  // Default: 1ns per byte
            *per_byte * bytes as f64
        }
    }
}
```

### 2.2 AI Training Decomposition Example

```rust
// synor-compute/src/heterogeneous/training.rs

/// Decompose AI training into heterogeneous subtasks
pub struct TrainingDecomposer;

impl TrainingDecomposer {
    /// Decompose a training iteration into processor-specific tasks
    pub fn decompose_iteration(
        &self,
        model: &Model,
        batch: &Batch,
        available_processors: &[ProcessorInfo],
    ) -> DecomposedIteration {
        let mut tasks = Vec::new();

        // ═══════════════════════════════════════════════════════════════
        // PHASE 1: DATA LOADING & PREPROCESSING → CPU
        // ═══════════════════════════════════════════════════════════════
        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::DataLoad {
                batch_ids: batch.ids.clone(),
                shuffle: true,
            },
            optimal_processor: ProcessorType::Cpu(CpuVariant::X86_64 { avx: AvxSupport::Avx512 }),
            priority: TaskPriority::High,
            dependencies: vec![],
        });

        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::DataPreprocess {
                transforms: vec![
                    Transform::Normalize,
                    Transform::Augment,
                    Transform::ToTensor,
                ],
            },
            optimal_processor: ProcessorType::Cpu(CpuVariant::X86_64 { avx: AvxSupport::Avx512 }),
            priority: TaskPriority::High,
            dependencies: vec![tasks[0].id],
        });

        // ═══════════════════════════════════════════════════════════════
        // PHASE 2: TOKENIZATION (for LLMs) → CPU or NPU
        // ═══════════════════════════════════════════════════════════════
        if model.model_type == ModelType::Llm {
            tasks.push(Task {
                id: TaskId::new(),
                operation: Operation::Tokenization {
                    vocab_size: model.vocab_size,
                    max_length: model.max_seq_len,
                },
                optimal_processor: ProcessorType::Cpu(CpuVariant::X86_64 { avx: AvxSupport::Avx512 }),
                priority: TaskPriority::High,
                dependencies: vec![tasks[1].id],
            });
        }

        // ═══════════════════════════════════════════════════════════════
        // PHASE 3: EMBEDDING LOOKUP → GPU (memory bandwidth bound)
        // ═══════════════════════════════════════════════════════════════
        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::Embedding {
                vocab_size: model.vocab_size,
                embedding_dim: model.embedding_dim,
            },
            optimal_processor: ProcessorType::Gpu(GpuVariant::NvidiaCuda {
                compute_capability: (9, 0), // H100
            }),
            priority: TaskPriority::High,
            dependencies: vec![tasks.last().unwrap().id],
        });

        // ═══════════════════════════════════════════════════════════════
        // PHASE 4: TRANSFORMER LAYERS → TPU or GPU (compute bound)
        // ═══════════════════════════════════════════════════════════════
        let embedding_task_id = tasks.last().unwrap().id;

        for layer_idx in 0..model.num_layers {
            // Self-attention → TPU optimal (large matrix multiplies)
            tasks.push(Task {
                id: TaskId::new(),
                operation: Operation::SelfAttention {
                    layer: layer_idx,
                    num_heads: model.num_heads,
                    head_dim: model.head_dim,
                    use_flash: true,
                },
                optimal_processor: ProcessorType::Tpu(TpuVersion::V5p),
                fallback_processor: Some(ProcessorType::Gpu(GpuVariant::NvidiaCuda {
                    compute_capability: (9, 0),
                })),
                priority: TaskPriority::Critical,
                dependencies: vec![
                    if layer_idx == 0 { embedding_task_id } else { tasks.last().unwrap().id }
                ],
            });

            // FFN (Feed-Forward Network) → GPU optimal
            tasks.push(Task {
                id: TaskId::new(),
                operation: Operation::FeedForward {
                    layer: layer_idx,
                    hidden_dim: model.ffn_dim,
                    activation: Activation::SiLU,
                },
                optimal_processor: ProcessorType::Gpu(GpuVariant::NvidiaCuda {
                    compute_capability: (9, 0),
                }),
                priority: TaskPriority::Critical,
                dependencies: vec![tasks.last().unwrap().id],
            });
        }

        // ═══════════════════════════════════════════════════════════════
        // PHASE 5: OUTPUT PROJECTION & LOSS → GPU
        // ═══════════════════════════════════════════════════════════════
        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::OutputProjection {
                vocab_size: model.vocab_size,
            },
            optimal_processor: ProcessorType::Gpu(GpuVariant::NvidiaCuda {
                compute_capability: (9, 0),
            }),
            priority: TaskPriority::High,
            dependencies: vec![tasks.last().unwrap().id],
        });

        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::CrossEntropyLoss {},
            optimal_processor: ProcessorType::Gpu(GpuVariant::NvidiaCuda {
                compute_capability: (9, 0),
            }),
            priority: TaskPriority::High,
            dependencies: vec![tasks.last().unwrap().id],
        });

        // ═══════════════════════════════════════════════════════════════
        // PHASE 6: BACKWARD PASS → Same as forward, reversed
        // ═══════════════════════════════════════════════════════════════
        let forward_tasks = tasks.clone();
        for task in forward_tasks.iter().rev() {
            if let Some(backward_op) = task.operation.backward() {
                tasks.push(Task {
                    id: TaskId::new(),
                    operation: backward_op,
                    optimal_processor: task.optimal_processor,
                    priority: task.priority,
                    dependencies: vec![tasks.last().unwrap().id],
                });
            }
        }

        // ═══════════════════════════════════════════════════════════════
        // PHASE 7: GRADIENT AGGREGATION → CPU (network I/O) + GPU (compute)
        // ═══════════════════════════════════════════════════════════════
        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::AllReduce {
                algorithm: AllReduceAlgorithm::RingAllReduce,
            },
            optimal_processor: ProcessorType::Cpu(CpuVariant::X86_64 { avx: AvxSupport::Avx512 }),
            priority: TaskPriority::Critical,
            dependencies: vec![tasks.last().unwrap().id],
        });

        // ═══════════════════════════════════════════════════════════════
        // PHASE 8: OPTIMIZER STEP → GPU
        // ═══════════════════════════════════════════════════════════════
        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::OptimizerStep {
                optimizer: OptimizerType::AdamW,
                learning_rate: 1e-4,
            },
            optimal_processor: ProcessorType::Gpu(GpuVariant::NvidiaCuda {
                compute_capability: (9, 0),
            }),
            priority: TaskPriority::High,
            dependencies: vec![tasks.last().unwrap().id],
        });

        // ═══════════════════════════════════════════════════════════════
        // PHASE 9: CHECKPOINTING → CPU (I/O)
        // ═══════════════════════════════════════════════════════════════
        tasks.push(Task {
            id: TaskId::new(),
            operation: Operation::Checkpoint {
                async_: true,
            },
            optimal_processor: ProcessorType::Cpu(CpuVariant::X86_64 { avx: AvxSupport::Avx512 }),
            priority: TaskPriority::Low,
            dependencies: vec![tasks.last().unwrap().id],
        });

        DecomposedIteration {
            tasks,
            estimated_time: self.estimate_total_time(&tasks),
            processor_utilization: self.estimate_utilization(&tasks),
        }
    }
}
```

---

## Part 3: Heterogeneous Scheduler

### 3.1 Multi-Queue Scheduler

```rust
// synor-compute/src/heterogeneous/scheduler.rs

/// Scheduler that manages tasks across all processor types
pub struct HeterogeneousScheduler {
    /// Per-processor-type task queues
    queues: HashMap<ProcessorType, TaskQueue>,
    /// Available processors
    processors: Vec<Arc<dyn Processor>>,
    /// Task dependency tracker
    dependencies: DependencyTracker,
    /// Load balancer
    load_balancer: LoadBalancer,
    /// Data placement optimizer
    data_placement: DataPlacementOptimizer,
}

impl HeterogeneousScheduler {
    /// Schedule a decomposed workload
    pub async fn schedule(&self, workload: DecomposedWorkload) -> Result<ScheduleResult, Error> {
        // 1. Build execution graph
        let graph = self.dependencies.build_graph(&workload.tasks);

        // 2. Assign tasks to processors
        let assignment = self.assign_tasks(&workload.tasks, &graph).await?;

        // 3. Optimize data placement
        let data_plan = self.data_placement.optimize(&assignment).await?;

        // 4. Create execution schedule
        let schedule = self.create_schedule(&assignment, &data_plan, &graph)?;

        Ok(ScheduleResult {
            schedule,
            data_plan,
            estimated_makespan: self.estimate_makespan(&schedule),
            processor_utilization: self.estimate_utilization(&schedule),
        })
    }

    /// Assign tasks to optimal processors
    async fn assign_tasks(
        &self,
        tasks: &[Task],
        graph: &DependencyGraph,
    ) -> Result<TaskAssignment, Error> {
        let mut assignment = TaskAssignment::new();

        // Sort tasks by priority and dependencies (topological sort)
        let sorted_tasks = graph.topological_sort(tasks);

        for task in sorted_tasks {
            // Find best processor for this task
            let best_processor = self.find_best_processor(&task).await?;

            // Check if we should steal work for load balancing
            let final_processor = self.load_balancer
                .maybe_rebalance(&task, best_processor, &assignment)
                .await?;

            assignment.assign(task.id, final_processor);
        }

        Ok(assignment)
    }

    /// Find the best processor for a task
    async fn find_best_processor(&self, task: &Task) -> Result<ProcessorId, Error> {
        let mut best_score = f64::NEG_INFINITY;
        let mut best_processor = None;

        for processor in &self.processors {
            if !processor.can_execute(&task.operation) {
                continue;
            }

            // Score = 1 / (execution_time + data_transfer_time)
            let exec_time = processor.estimate_time(&task.operation);
            let transfer_time = self.estimate_data_transfer_time(task, processor.as_ref());
            let total_time = exec_time + transfer_time;

            // Adjust for current load
            let load_factor = 1.0 + processor.utilization();
            let adjusted_time = total_time.as_secs_f64() * load_factor;

            let score = 1.0 / adjusted_time;

            if score > best_score {
                best_score = score;
                best_processor = Some(processor.id());
            }
        }

        best_processor.ok_or(Error::NoSuitableProcessor)
    }

    /// Execute the schedule
    pub async fn execute(&self, schedule: &Schedule) -> Result<ExecutionResult, Error> {
        let mut handles = Vec::new();
        let results = Arc::new(Mutex::new(HashMap::new()));
        let completed = Arc::new(AtomicUsize::new(0));

        // Create execution contexts for each processor
        let contexts: HashMap<ProcessorId, ExecutionContext> = self.processors
            .iter()
            .map(|p| (p.id(), ExecutionContext::new(p.clone())))
            .collect();

        // Execute tasks in schedule order
        for stage in &schedule.stages {
            // Execute all tasks in this stage in parallel
            let stage_handles: Vec<_> = stage.tasks
                .iter()
                .map(|task_id| {
                    let task = schedule.get_task(*task_id);
                    let processor_id = schedule.get_assignment(*task_id);
                    let context = contexts.get(&processor_id).unwrap().clone();
                    let results = results.clone();
                    let completed = completed.clone();

                    tokio::spawn(async move {
                        // Wait for dependencies
                        task.wait_for_dependencies(&results).await;

                        // Execute on assigned processor
                        let result = context.execute(&task).await?;

                        // Store result
                        results.lock().await.insert(task.id, result);
                        completed.fetch_add(1, Ordering::SeqCst);

                        Ok::<_, Error>(())
                    })
                })
                .collect();

            // Wait for all tasks in stage to complete
            for handle in stage_handles {
                handle.await??;
            }
        }

        Ok(ExecutionResult {
            results: Arc::try_unwrap(results).unwrap().into_inner(),
            total_time: schedule.estimated_makespan,
            processor_utilization: self.measure_utilization(&contexts),
        })
    }
}
```

### 3.2 Work Stealing for Load Balancing

```rust
// synor-compute/src/heterogeneous/work_stealing.rs

/// Work stealing scheduler for load balancing
pub struct WorkStealingScheduler {
    /// Per-processor work queues (deques for work stealing)
    queues: HashMap<ProcessorId, WorkQueue>,
    /// Stealing policy
    policy: StealingPolicy,
}

impl WorkStealingScheduler {
    /// Try to steal work for an idle processor
    pub async fn try_steal(&self, idle_processor: ProcessorId) -> Option<Task> {
        let idle_type = self.get_processor_type(idle_processor);

        // Find most loaded processor with compatible tasks
        let mut best_victim = None;
        let mut best_load = 0;

        for (proc_id, queue) in &self.queues {
            if *proc_id == idle_processor {
                continue;
            }

            // Check if this queue has tasks compatible with idle processor
            let compatible_count = queue.count_compatible(idle_type);
            if compatible_count > best_load {
                best_load = compatible_count;
                best_victim = Some(*proc_id);
            }
        }

        // Steal from the most loaded compatible queue
        if let Some(victim) = best_victim {
            let victim_queue = self.queues.get(&victim)?;

            // Steal from the back of the queue (oldest tasks)
            victim_queue.steal_compatible(idle_type).await
        } else {
            None
        }
    }

    /// Rebalance when processor utilization is uneven
    pub async fn rebalance(&self) -> Vec<TaskMigration> {
        let mut migrations = Vec::new();

        // Calculate average utilization
        let total_util: f64 = self.queues.values().map(|q| q.utilization()).sum();
        let avg_util = total_util / self.queues.len() as f64;

        // Find overloaded and underloaded processors
        let mut overloaded: Vec<_> = self.queues.iter()
            .filter(|(_, q)| q.utilization() > avg_util * 1.2)
            .collect();
        let mut underloaded: Vec<_> = self.queues.iter()
            .filter(|(_, q)| q.utilization() < avg_util * 0.8)
            .collect();

        // Sort by utilization
        overloaded.sort_by(|a, b| b.1.utilization().partial_cmp(&a.1.utilization()).unwrap());
        underloaded.sort_by(|a, b| a.1.utilization().partial_cmp(&b.1.utilization()).unwrap());

        // Migrate tasks from overloaded to underloaded
        for (over_id, over_queue) in overloaded {
            for (under_id, under_queue) in &underloaded {
                if over_queue.utilization() <= avg_util {
                    break;
                }

                let under_type = self.get_processor_type(**under_id);

                // Find tasks that can be migrated
                if let Some(task) = over_queue.find_migratable(under_type) {
                    migrations.push(TaskMigration {
                        task_id: task.id,
                        from: *over_id,
                        to: **under_id,
                    });
                }
            }
        }

        migrations
    }
}

/// Work queue with lock-free deque for work stealing
pub struct WorkQueue {
    /// Double-ended queue for work stealing
    deque: crossbeam_deque::Injector<Task>,
    /// Local queues per worker
    local: Vec<crossbeam_deque::Worker<Task>>,
    /// Stealers for other workers
    stealers: Vec<crossbeam_deque::Stealer<Task>>,
    /// Current utilization
    utilization: AtomicU64,
}

impl WorkQueue {
    /// Push task (owner pushes to front)
    pub fn push(&self, task: Task) {
        self.deque.push(task);
    }

    /// Pop task (owner pops from front)
    pub fn pop(&self) -> Option<Task> {
        self.deque.steal().success()
    }

    /// Steal task (thieves steal from back)
    pub async fn steal_compatible(&self, processor_type: ProcessorType) -> Option<Task> {
        // Try to steal a task compatible with the given processor type
        loop {
            match self.deque.steal() {
                crossbeam_deque::Steal::Success(task) => {
                    if task.is_compatible_with(processor_type) {
                        return Some(task);
                    } else {
                        // Put it back and try again
                        self.deque.push(task);
                    }
                }
                crossbeam_deque::Steal::Empty => return None,
                crossbeam_deque::Steal::Retry => continue,
            }
        }
    }
}
```

### 3.3 Pipeline Parallelism Across Processors

```rust
// synor-compute/src/heterogeneous/pipeline.rs

/// Pipeline parallelism across heterogeneous processors
pub struct HeterogeneousPipeline {
    /// Pipeline stages
    stages: Vec<PipelineStage>,
    /// Inter-stage buffers
    buffers: Vec<PipelineBuffer>,
    /// Synchronization
    sync: PipelineSync,
}

/// A stage in the pipeline assigned to a processor type
pub struct PipelineStage {
    pub stage_id: usize,
    pub operations: Vec<Operation>,
    pub processor_type: ProcessorType,
    pub processors: Vec<ProcessorId>,  // Multiple processors for parallelism
}

impl HeterogeneousPipeline {
    /// Create a pipeline for LLM inference
    pub fn create_llm_pipeline(
        model: &LlmModel,
        available_processors: &ProcessorRegistry,
    ) -> Self {
        let mut stages = Vec::new();

        // Stage 1: Tokenization → CPU
        stages.push(PipelineStage {
            stage_id: 0,
            operations: vec![Operation::Tokenization { .. }],
            processor_type: ProcessorType::Cpu(CpuVariant::X86_64 { .. }),
            processors: available_processors.get_type(ProcessorType::Cpu(..)),
        });

        // Stage 2: Embedding → GPU (memory bound)
        stages.push(PipelineStage {
            stage_id: 1,
            operations: vec![Operation::Embedding { .. }],
            processor_type: ProcessorType::Gpu(GpuVariant::NvidiaCuda { .. }),
            processors: available_processors.get_type(ProcessorType::Gpu(..)),
        });

        // Stage 3: Transformer layers → TPU (if available) or GPU
        let transformer_processor = if available_processors.has_tpu() {
            ProcessorType::Tpu(TpuVersion::V5p)
        } else {
            ProcessorType::Gpu(GpuVariant::NvidiaCuda { compute_capability: (9, 0) })
        };

        stages.push(PipelineStage {
            stage_id: 2,
            operations: model.layers.iter().flat_map(|l| l.operations()).collect(),
            processor_type: transformer_processor,
            processors: available_processors.get_type(transformer_processor),
        });

        // Stage 4: Token generation → LPU (if available, best for sequential) or GPU
        let generation_processor = if available_processors.has_lpu() {
            ProcessorType::Lpu
        } else {
            ProcessorType::Gpu(GpuVariant::NvidiaCuda { compute_capability: (9, 0) })
        };

        stages.push(PipelineStage {
            stage_id: 3,
            operations: vec![
                Operation::OutputProjection { .. },
                Operation::Sampling { .. },
            ],
            processor_type: generation_processor,
            processors: available_processors.get_type(generation_processor),
        });

        // Stage 5: Detokenization → CPU
        stages.push(PipelineStage {
            stage_id: 4,
            operations: vec![Operation::Detokenization { .. }],
            processor_type: ProcessorType::Cpu(CpuVariant::X86_64 { .. }),
            processors: available_processors.get_type(ProcessorType::Cpu(..)),
        });

        // Create inter-stage buffers
        let buffers = (0..stages.len() - 1)
            .map(|i| PipelineBuffer::new(
                stages[i].processor_type,
                stages[i + 1].processor_type,
            ))
            .collect();

        Self {
            stages,
            buffers,
            sync: PipelineSync::new(),
        }
    }

    /// Execute pipeline with micro-batching
    pub async fn execute_stream(
        &self,
        input_stream: impl Stream<Item = Request>,
    ) -> impl Stream<Item = Response> {
        let (tx, rx) = mpsc::channel(1024);

        // Start pipeline stages
        for (i, stage) in self.stages.iter().enumerate() {
            let input_buffer = if i == 0 {
                None
            } else {
                Some(self.buffers[i - 1].clone())
            };

            let output_buffer = if i == self.stages.len() - 1 {
                None
            } else {
                Some(self.buffers[i].clone())
            };

            let stage = stage.clone();
            let tx = tx.clone();

            tokio::spawn(async move {
                stage.run(input_buffer, output_buffer, tx).await;
            });
        }

        // Feed input stream to first stage
        let first_buffer = self.buffers[0].clone();
        tokio::spawn(async move {
            pin_mut!(input_stream);
            while let Some(request) = input_stream.next().await {
                first_buffer.push(request.into()).await;
            }
        });

        ReceiverStream::new(rx)
    }
}

/// Buffer between pipeline stages with automatic data transfer
pub struct PipelineBuffer {
    /// Source processor type
    source_type: ProcessorType,
    /// Destination processor type
    dest_type: ProcessorType,
    /// Data queue
    queue: Arc<ArrayQueue<PipelineData>>,
    /// Transfer strategy
    transfer: DataTransferStrategy,
}

impl PipelineBuffer {
    /// Push data from source stage
    pub async fn push(&self, data: PipelineData) {
        // Transfer data if processors have different memory spaces
        let transferred = if self.needs_transfer() {
            self.transfer.transfer(&data, self.source_type, self.dest_type).await
        } else {
            data
        };

        self.queue.push(transferred).unwrap();
    }

    /// Pop data for destination stage
    pub async fn pop(&self) -> Option<PipelineData> {
        self.queue.pop()
    }

    fn needs_transfer(&self) -> bool {
        !self.source_type.shares_memory_with(&self.dest_type)
    }
}
```

---

## Part 4: Data Movement Optimization

### 4.1 Unified Memory Management

```rust
// synor-compute/src/heterogeneous/memory.rs

/// Unified memory manager across all processor types
pub struct UnifiedMemoryManager {
    /// Memory allocators per processor type
    allocators: HashMap<ProcessorType, Box<dyn MemoryAllocator>>,
    /// Data location tracker
    locations: DataLocationTracker,
    /// Transfer scheduler
    transfer_scheduler: TransferScheduler,
    /// Prefetch predictor
    prefetcher: PrefetchPredictor,
}

impl UnifiedMemoryManager {
    /// Allocate tensor with optimal placement
    pub async fn allocate_tensor(
        &self,
        shape: &[usize],
        dtype: DataType,
        hint: PlacementHint,
    ) -> Result<TensorHandle, Error> {
        // Determine optimal initial placement
        let location = match hint {
            PlacementHint::Processor(proc_type) => proc_type,
            PlacementHint::Operation(op) => self.optimal_location_for_op(&op),
            PlacementHint::Auto => self.predict_optimal_location(shape, dtype),
        };

        // Allocate on chosen processor
        let allocator = self.allocators.get(&location)?;
        let ptr = allocator.allocate(shape.iter().product::<usize>() * dtype.size())?;

        // Register with location tracker
        let handle = TensorHandle::new(ptr, shape.to_vec(), dtype);
        self.locations.register(&handle, location);

        Ok(handle)
    }

    /// Ensure tensor is available on specified processor
    pub async fn ensure_on(
        &self,
        tensor: &TensorHandle,
        target: ProcessorType,
    ) -> Result<TensorView, Error> {
        let current_location = self.locations.get(tensor)?;

        if current_location == target {
            // Already on target, return view
            return Ok(TensorView::new(tensor, target));
        }

        // Check if already cached on target
        if let Some(cached) = self.locations.get_cached(tensor, target) {
            return Ok(cached);
        }

        // Need to transfer
        let transfer = self.transfer_scheduler.schedule_transfer(
            tensor,
            current_location,
            target,
        ).await?;

        // Execute transfer
        transfer.execute().await?;

        // Register new location
        self.locations.add_copy(tensor, target);

        Ok(TensorView::new(tensor, target))
    }

    /// Prefetch data before it's needed
    pub async fn prefetch(&self, tensor: &TensorHandle, target: ProcessorType) {
        // Don't wait, just schedule the transfer
        let _ = self.transfer_scheduler.schedule_transfer_async(
            tensor,
            self.locations.get(tensor).unwrap_or(ProcessorType::Cpu(Default::default())),
            target,
        ).await;
    }
}

/// Optimized data transfer between processors
pub struct TransferScheduler {
    /// Direct transfer paths (e.g., NVLink, PCIe P2P)
    direct_paths: HashMap<(ProcessorType, ProcessorType), TransferPath>,
    /// Transfer queue
    queue: TransferQueue,
}

impl TransferScheduler {
    /// Schedule optimal transfer
    pub async fn schedule_transfer(
        &self,
        tensor: &TensorHandle,
        from: ProcessorType,
        to: ProcessorType,
    ) -> Result<Transfer, Error> {
        // Find optimal path
        let path = self.find_optimal_path(from, to, tensor.size_bytes());

        // Create transfer
        let transfer = Transfer {
            tensor: tensor.clone(),
            path,
            size: tensor.size_bytes(),
        };

        // Add to queue (batching similar transfers)
        self.queue.enqueue(transfer.clone()).await;

        Ok(transfer)
    }

    fn find_optimal_path(
        &self,
        from: ProcessorType,
        to: ProcessorType,
        size: usize,
    ) -> TransferPath {
        // Check for direct path first
        if let Some(direct) = self.direct_paths.get(&(from, to)) {
            return direct.clone();
        }

        // Check for direct path in reverse (bidirectional)
        if let Some(direct) = self.direct_paths.get(&(to, from)) {
            return direct.clone();
        }

        // Fall back to CPU-mediated transfer
        TransferPath::CpuMediated { from, to }
    }
}

/// Available transfer paths
#[derive(Clone, Debug)]
pub enum TransferPath {
    /// Direct GPU-to-GPU (NVLink, NVSwitch)
    NvLink { bandwidth_gbps: u32 },
    /// PCIe peer-to-peer
    PciePeerToPeer { gen: u8, lanes: u8 },
    /// Through CPU memory (slowest)
    CpuMediated { from: ProcessorType, to: ProcessorType },
    /// Unified memory (Apple, some AMD APUs)
    UnifiedMemory,
    /// Network transfer (for distributed)
    Network { protocol: NetworkProtocol },
}
```

---

## Part 5: Example: Heterogeneous LLM Inference

### 5.1 Complete Example Flow

```rust
// synor-compute/src/examples/heterogeneous_llm.rs

/// Example: Running LLM inference across CPU + GPU + TPU + LPU
pub async fn run_heterogeneous_inference(
    prompt: &str,
    model: &LlmModel,
    processors: &ProcessorRegistry,
) -> Result<String, Error> {
    let scheduler = HeterogeneousScheduler::new(processors);

    // ═══════════════════════════════════════════════════════════════
    // STEP 1: TOKENIZATION (CPU)
    // CPU is optimal for string processing and variable-length operations
    // ═══════════════════════════════════════════════════════════════
    let cpu = processors.get_best(ProcessorType::Cpu(..))?;
    let tokens = cpu.execute(Operation::Tokenization {
        text: prompt.to_string(),
        vocab: model.vocab.clone(),
    }).await?;

    println!("✓ Tokenization complete on CPU: {} tokens", tokens.len());

    // ═══════════════════════════════════════════════════════════════
    // STEP 2: EMBEDDING LOOKUP (GPU)
    // GPU is optimal for memory-bandwidth-bound operations
    // ═══════════════════════════════════════════════════════════════
    let gpu = processors.get_best(ProcessorType::Gpu(..))?;
    let embeddings = gpu.execute(Operation::Embedding {
        tokens: tokens.clone(),
        embedding_table: model.embedding_table.clone(),
    }).await?;

    println!("✓ Embedding complete on GPU");

    // ═══════════════════════════════════════════════════════════════
    // STEP 3: PREFILL (PARALLEL ATTENTION) → TPU or GPU
    // TPU excels at large matrix multiplications with fixed shapes
    // ═══════════════════════════════════════════════════════════════
    let prefill_processor = processors
        .get_best(ProcessorType::Tpu(..))
        .or_else(|_| processors.get_best(ProcessorType::Gpu(..)))?;

    let mut hidden_states = embeddings;

    for layer_idx in 0..model.num_layers {
        hidden_states = prefill_processor.execute(Operation::TransformerLayer {
            layer: layer_idx,
            input: hidden_states,
            attention_mask: None,
            kv_cache: None,  // No cache for prefill
        }).await?;
    }

    println!("✓ Prefill complete on {:?}", prefill_processor.processor_type());

    // ═══════════════════════════════════════════════════════════════
    // STEP 4: DECODE (SEQUENTIAL TOKEN GENERATION) → LPU or GPU
    // LPU excels at sequential, low-batch operations (autoregressive)
    // ═══════════════════════════════════════════════════════════════
    let decode_processor = processors
        .get_best(ProcessorType::Lpu)
        .or_else(|_| processors.get_best(ProcessorType::Gpu(..)))?;

    let mut generated_tokens = Vec::new();
    let mut kv_cache = KvCache::new();

    for _ in 0..model.max_new_tokens {
        // Run one decode step
        let logits = decode_processor.execute(Operation::DecodeStep {
            hidden_states: hidden_states.last_token(),
            kv_cache: &mut kv_cache,
            layers: &model.layers,
        }).await?;

        // Sample next token
        let next_token = decode_processor.execute(Operation::Sampling {
            logits,
            temperature: 0.7,
            top_p: 0.9,
        }).await?;

        if next_token == model.eos_token {
            break;
        }

        generated_tokens.push(next_token);

        // Get embedding for next iteration
        hidden_states = gpu.execute(Operation::Embedding {
            tokens: vec![next_token],
            embedding_table: model.embedding_table.clone(),
        }).await?;
    }

    println!("✓ Decode complete on {:?}: {} tokens generated",
             decode_processor.processor_type(),
             generated_tokens.len());

    // ═══════════════════════════════════════════════════════════════
    // STEP 5: DETOKENIZATION (CPU)
    // CPU handles string operations and variable-length output
    // ═══════════════════════════════════════════════════════════════
    let output = cpu.execute(Operation::Detokenization {
        tokens: generated_tokens,
        vocab: model.vocab.clone(),
    }).await?;

    println!("✓ Detokenization complete on CPU");

    Ok(output)
}
```

### 5.2 Utilization Report

```
╔═══════════════════════════════════════════════════════════════════════════╗
║                    HETEROGENEOUS INFERENCE REPORT                          ║
╠═══════════════════════════════════════════════════════════════════════════╣
║                                                                            ║
║  Model: Llama-70B                                                         ║
║  Input: 512 tokens                                                        ║
║  Output: 256 tokens                                                       ║
║                                                                            ║
║  ┌────────────────────────────────────────────────────────────────────┐   ║
║  │ PROCESSOR UTILIZATION                                               │   ║
║  ├────────────┬──────────┬──────────┬──────────┬────────────────────┤   ║
║  │ Processor  │ Time     │ Util %   │ Tasks    │ Operations         │   ║
║  ├────────────┼──────────┼──────────┼──────────┼────────────────────┤   ║
║  │ CPU        │ 15ms     │ 8%       │ 2        │ Token, Detoken     │   ║
║  │ GPU (H100) │ 120ms    │ 65%      │ 257      │ Embedding (×257)   │   ║
║  │ TPU v5p    │ 200ms    │ 95%      │ 80       │ Prefill layers     │   ║
║  │ LPU (Groq) │ 450ms    │ 92%      │ 256      │ Decode steps       │   ║
║  └────────────┴──────────┴──────────┴──────────┴────────────────────┘   ║
║                                                                            ║
║  Total Time: 785ms (vs 2400ms GPU-only = 3.1x speedup)                   ║
║  Zero Idle Processors: ✓                                                  ║
║                                                                            ║
║  ┌────────────────────────────────────────────────────────────────────┐   ║
║  │ TIMELINE                                                            │   ║
║  ├────────────────────────────────────────────────────────────────────┤   ║
║  │                                                                      │   ║
║  │ CPU  ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██  │   ║
║  │      │Tok                                                    Detok│  │   ║
║  │                                                                      │   ║
║  │ GPU  ░░██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  │   ║
║  │         │Embed×512  │                                              │   ║
║  │                                                                      │   ║
║  │ TPU  ░░░░░░░░░░░░░░██████████████████████████░░░░░░░░░░░░░░░░░░░░  │   ║
║  │                     │Prefill (80 layers)     │                      │   ║
║  │                                                                      │   ║
║  │ LPU  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██████████████████████████  │   ║
║  │                                          │Decode (256 steps)      │  │   ║
║  │                                                                      │   ║
║  │      0ms         200ms        400ms        600ms        800ms      │   ║
║  └────────────────────────────────────────────────────────────────────┘   ║
║                                                                            ║
╚═══════════════════════════════════════════════════════════════════════════╝
```

---

## Summary: Multi-Processor Advantages

### Processor-Task Mapping

| Task Type | Best Processor | Why |
|-----------|----------------|-----|
| Data loading, I/O | **CPU** | Sequential, system calls |
| Tokenization/Detokenization | **CPU** | String processing |
| Embedding lookup | **GPU** | Memory bandwidth |
| Matrix multiply (large) | **TPU** | Dedicated MXU units |
| Attention (prefill) | **TPU/GPU** | Parallel, compute-bound |
| Token generation (decode) | **LPU** | Sequential, low latency |
| On-device inference | **NPU** | Power efficient |
| Browser compute | **WebGPU** | Platform agnostic |
| Cryptography | **FPGA** | Custom bit operations |
| Signal processing | **DSP** | Specialized math |

### Expected Speedups

| Workload | GPU-Only | Heterogeneous | Speedup |
|----------|----------|---------------|---------|
| LLM Training | 1x | 1.5-2x | +50-100% |
| LLM Inference | 1x | 2-4x | +100-300% |
| Image Generation | 1x | 1.3-1.8x | +30-80% |
| RAG Pipeline | 1x | 2-3x | +100-200% |
| Real-time Video | 1x | 3-5x | +200-400% |

### Zero Idle Guarantee

The heterogeneous scheduler ensures:
1. **Parallel execution** across processor types
2. **Pipeline overlap** between stages
3. **Work stealing** when processors become idle
4. **Predictive prefetching** of data
5. **Dynamic rebalancing** based on actual throughput

This architecture maximizes hardware utilization and minimizes total execution time by using EVERY available processor simultaneously.