Files
quic_ecs_dt/CLAUDE.md
2026-05-12 11:21:40 -04:00

164 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# quic_ecs_dt — Project Guide for Claude
## What & why
Source repo for **"QUIC + ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins"** — UCAmI 2026 (Plantevin & Francillette, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:
- `plantevin2026ecs` — ECS as runtime substrate for industrial DT (200k assets @ 114 Hz on Pi 5).
- `plantevin2026quic` — QUIC partial reliability for DT sensor streams (94% P99 reduction vs TCP at 5% loss).
**UCAmI hypothesis (the composition question):** prior work shows ECS and QUIC each work as substrates *independently*. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it.
## Architecture
Three-tier QUIC ↔ ECS bridge, headless Bevy runtime:
| Tier | QUIC primitive | Use case | Channel cap | Tx newtype |
|------|----------------|----------|-------------|------------|
| T1 | Unreliable datagrams (RFC 9221) | High-freq ephemeral telemetry; drops OK | 1024 | `T1Sender::send_lossy` (try_send, drop on full) |
| T2 | Unidirectional streams | Ordered threshold events; reliable | 512 | `T2Sender::send` (await, backpressure) |
| T3 | Bidirectional streams | Actuator commands w/ ACK; per-command oneshot reply | 256 | `T3Sender::send` of `T3Inbound { command, reply }` |
QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime; pushes decoded `QuicMessage` (UUID + sensor_id + f64 + ts + seq, 38 B fixed LE) into `tokio::sync::mpsc` per tier via the `T1Sender / T2Sender / T3Sender` newtypes (in [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs)) so misuse is a type error. Bevy `ingest_system` drains in `PreUpdate`, gated by `run_if(in_state(ServerState::Started))`. Pattern is in [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs).
**T3 ack protocol.** A device opens a bi-stream and writes one `QuicMessage` (the command). The demux task reads it, builds a `T3Inbound { command, reply: oneshot::Sender<QuicMessage> }`, and sends it on the T3 mpsc. The ECS handler writes the ack into `reply`; the demux task awaits `reply_rx` and writes the resulting `QuicMessage` back on the bi-stream. Dropping the oneshot signals "no handler" and propagates as a stream close — used by the placeholder ingest until M4 installs real handlers.
**Target hardware:** CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand.
## Repo map
```
quic_ecs_dt/
├── paper/ Quarto + LNCS source — single index.qmd, refs in references.bib
├── substrate/ Rust crate: Bevy 0.18 + Quinn 0.11 + rustls 0.23 + Tokio
│ └── src/
│ ├── main.rs App::new, MinimalPlugins, EcsQuicTransportPlugin
│ ├── config.rs figment chain: defaults → config.toml → APP_* env
│ └── transport/
│ ├── mod.rs QuicMessage struct
│ ├── ecs.rs Plugin: tokio thread + 3 mpsc + PreUpdate ingest
│ └── server.rs run_substrate_server (EMPTY STUB)
├── simulator/ Rust crate: stub today; will be Quinn client + Bevy sensor generators
├── data/ (created by M6) loopback/, two_machine/ — raw CSVs committed, *_processed ignored
├── Cargo.toml workspace
└── Makefile render, preview, build, build-cm5, deploy-cm5
```
## Status
| Area | State |
|------|-------|
| `AppConfig` figment loader (defaults → TOML → env) | Done — [substrate/src/config.rs:42](substrate/src/config.rs#L42) |
| 3-tier MPSC bridge scaffolding (Tokio thread + Bevy plugin) | Done — [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs) |
| `QuicMessage` struct (no codec yet) | Defined — [substrate/src/transport/mod.rs:4](substrate/src/transport/mod.rs#L4) |
| Quinn server lifecycle | Listener up — `ServerState{Starting,Started}` in [substrate/src/transport/state.rs](substrate/src/transport/state.rs); `OnEnter(Starting)` → bind + accept loop in [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs). Explicit `TransportConfig` w/ tuned datagram recv buffer (256 KiB) in [substrate/src/transport/server.rs](substrate/src/transport/server.rs). Per-tier sender newtypes (`T1Sender::send_lossy`, `T2Sender::send`, `T3Sender::send`) in [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs) |
| T1 demux (datagrams → ECS) | Done — `handle_incoming` orchestrator + `read_datagrams` reader in [substrate/src/transport/server.rs](substrate/src/transport/server.rs); decode errors logged but non-fatal; channel-full drops silent at trace; received/dropped/decode_errors counters in the end-of-stream debug line |
| T2 demux (uni streams → ECS) | Done — `read_uni_streams` accepts streams in [substrate/src/transport/server.rs](substrate/src/transport/server.rs), spawns one task per stream that reads 38 B chunks until EOF; decode failure resets the stream via `recv.stop(0)` (one bad stream doesn't kill the connection); `t2.send().await` honours backpressure |
| T3 demux (bi streams ↔ ECS) | Done — `accept_bi_streams` + `read_one_bi_stream` in [substrate/src/transport/server.rs](substrate/src/transport/server.rs); reads 38 B command, ships `T3Inbound { command, reply: oneshot::Sender }` to the ECS, awaits the reply, writes 38 B ack and finishes. If the ECS drops the oneshot (no handler installed yet — the M4 placeholder) `send.reset(0)` gives the client a clean signal instead of a half-open stream. `handle_incoming` joins all three readers on close |
| TLS / self-signed cert | Done (M1) — `certs/server.{crt,key}` via `make certs`, gitignored. PEM loader in [substrate/src/transport/server.rs:15](substrate/src/transport/server.rs#L15); rustls `aws-lc-rs` default provider installed in [substrate/src/main.rs](substrate/src/main.rs) |
| Wire codec for `QuicMessage` (39 B fixed LE, incl. `sensor_type: u8`) | Done — [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs); 5 unit tests passing. `SensorType` enum: `Generic / Temperature / Humidity / Pressure / Voltage / Current` |
| `tracing-subscriber` init w/ `RUST_LOG` | Done (M1) — [substrate/src/main.rs:8-12](substrate/src/main.rs#L8-L12) |
| ECS components (`RawSensorData`, `SmoothedValue`) + 5 systems (Ingest/Sim/Export/FaultInjection/Diagnostics) | Done — entities = `(DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, Asset)` per (device, sensor); `SensorRegistry` upserts via `HashMap<(Uuid, u16), Entity>` in [substrate/src/world.rs](substrate/src/world.rs). `IngestSystem` drains all three tiers; T3 ack preserves command's `sensor_type` and returns the device's most recent `raw_value`. `SimulationSystem` maintains a 16-sample rolling mean per entity and emits `substrate_threshold_crossings_total{type, direction}` when the smoothed mean crosses a per-type threshold (`Changed<RawSensorData>` query so cost scales with ingress, not fleet size). `ExportSystem` samples `substrate_{entities,channel_depth,channel_capacity,rss_bytes}` + `sensor_aggregate{type, stat}` once per second. `FaultInjection` is still a stub awaiting M6. `Diagnostics` logs `tick_hz` once per second |
| Schedule rate-gating | Done (M4) — `MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz))` in [substrate/src/main.rs](substrate/src/main.rs); replaces the default busy-loop with the configured period |
| Prometheus exporter + Grafana dashboards | Done (M5) — `ObservabilityPlugin` in [substrate/src/observability.rs](substrate/src/observability.rs) installs `metrics-exporter-prometheus` on the existing tokio runtime. **Runtime surface** (paper §Evaluation): counters `substrate_received_total{tier}`, `dropped_total{tier=t1}`, `decode_errors_total{tier}`, `t3_no_handler_total`; latency histograms `substrate_latency_us{tier}`; gauges `substrate_tick_hz`, `substrate_entities`, `substrate_channel_depth{tier}`, `substrate_channel_capacity{tier}`, `substrate_rss_bytes`. **Sensor data surface** (operator dashboard): per-type aggregates `sensor_aggregate{type, stat=count|mean|min|max}` computed once per second over the live world, cardinality bounded by `\|SensorType\| × 4` so it scales to thousands of sensors. Two dashboards: [dashboards/runtime.json](dashboards/runtime.json) and [dashboards/sensors.json](dashboards/sensors.json) (thermometer/gauge/stat panels per type) |
| Simulator (Quinn client + sensor generators) | `SimulatorClient` lib in [simulator/src/client.rs](simulator/src/client.rs) — connects, trusts the substrate's PEM cert via custom `ServerCertVerifier` (sidesteps `CaUsedAsEndEntity`); `send_datagram(QuicMessage)` for T1, `send_uni_stream(&[QuicMessage])` for T2, `request(&QuicMessage) -> QuicMessage` for T3. CLI driver in [simulator/src/main.rs](simulator/src/main.rs) with clap flags (`--addr`, `--rate-hz`, `--t2-rate-hz`, `--t3-rate-hz`, `--t3-timeout-ms`, `--count`, `--devices`, `--sensor-id`, `--sensor-type`, `--profile`, `--cert`, `--server-name`); parallel T1+T2+T3 emitters, per-(device,sensor) sequence counters, type-appropriate waveform generators (sin/cos curves centred on realistic sensor ranges), 1-Hz combined progress logs, Ctrl-C drain. `--profile industrial` fans out to 5 sensors per device (Temperature/Humidity/Pressure/Voltage/Current). Bevy-driven sensor generator still pending |
| End-to-end test harness | Six integration tests across [simulator/tests/end_to_end_t1.rs](simulator/tests/end_to_end_t1.rs), [simulator/tests/end_to_end_t2.rs](simulator/tests/end_to_end_t2.rs), [simulator/tests/end_to_end_t3.rs](simulator/tests/end_to_end_t3.rs): T1 single-datagram round-trip + 32-msg burst order; T2 single-stream order-preservation + 4-stream concurrent per-device ordering; T3 round-trip with fake-ECS handler + no-handler stream-reset. Each test calls `bind_endpoint` + `accept_loop` in-process with channels owned by the test |
| `config.toml` at repo root | Done (M1) — [config.toml](config.toml); loaded by [substrate/src/main.rs:9](substrate/src/main.rs#L9) |
| Benchmark harness (sweep + CSV writer) | Missing |
| CM5 cross-compile / deploy | Wired in [Makefile:30](Makefile#L30); not exercised |
`cargo run -p substrate` boots, prints the loaded config, and idles on the (still-empty) Quinn server. `MinimalPlugins` busy-loops the ECS schedule by default — expected, will gate to `tick_rate_hz` in M4.
## Roadmap
Each milestone has one verification gate. Update Status here as we go.
- **M1 — Wire codec & root config.** ✅ Done 2026-05-04. Hand-rolled little-endian codec on `QuicMessage` (38 B fixed: 16 UUID + 2 stream_id + 8 f64 + 8 ts_us + 4 seq) with roundtrip + layout + length-error tests; `config.toml` at repo root; dev TLS via `make certs`; structured `tracing-subscriber` init reads `RUST_LOG` (default `info`).
- **M2 — Quinn server + self-signed TLS.** ✅ Done 2026-05-06. Listener up under `ServerState::Starting/Started`; type-system tier semantics + T3 oneshot ack protocol; per-connection `handle_incoming` orchestrator joining T1 datagram, T2 uni-stream, and T3 bi-stream readers. T1 has dropped/decoded counters; T2 resets a stream on decode failure without killing the connection; T3 ships `T3Inbound { command, reply }` to the ECS and resets the stream when no handler answers. End-to-end coverage: 6 integration tests in [simulator/tests/](simulator/tests/) plus 4 codec unit tests, all green.
- **M3 — Simulator client.** Replace [simulator/src/main.rs](simulator/src/main.rs) with a Bevy app: Quinn client, N synthetic devices, configurable per-tier rates. *Verify:* end-to-end loopback drains messages on all three tiers. **Status (2026-05-05):** simulator made into a lib + bin; `SimulatorClient::{connect,send_datagram,close}` plus a manual smoke runner in `simulator/src/main.rs`. Two integration tests in `simulator/tests/end_to_end_t1.rs` exercise the full T1 path against an in-process substrate. Bevy-driven generator + T2/T3 helpers + load profiles still pending.
- **M4 — ECS world.** ✅ Done. `Asset` + `DeviceId` + `SensorId` + `SensorTypeTag` + `RawSensorData` + `SmoothedValue` components in [substrate/src/world.rs](substrate/src/world.rs); `SensorRegistry` resource for O(1) `(Uuid, u16) → Entity`. `IngestSystem` drains all three tiers (T1 batched, T2/T3 fully); T3 handler returns the latest sensor value as ack. `SimulationSystem` runs a per-entity 16-sample rolling mean and emits `substrate_threshold_crossings_total{type, direction}` on per-type threshold crossings — gives the ECS observable digital-twin work, not just write-through ingest. `ExportSystem` samples `substrate_{entities,channel_depth,channel_capacity,rss_bytes}` + `sensor_aggregate{type, stat}` once per second. `FaultInjection` still a stub (M6). `DiagnosticsSystem` logs tick rate once per second. Schedule rate-gated via `ScheduleRunnerPlugin::run_loop(1/tick_rate_hz)`. 8 unit tests passing (entity create, in-place update, T3 ack, SmoothedValue push/window/non-finite/full-roll, threshold-crossing transition).
- **M5 — Observability (VictoriaMetrics + Grafana).** ✅ Done. Wire format extended to carry `sensor_type: u8` (38 → 39 B, decoded into `SensorType` enum). Two metric surfaces over `metrics-exporter-prometheus`:
- **Runtime** (paper §Evaluation): `substrate_received_total{tier}`, `dropped_total{tier=t1}`, `decode_errors_total{tier}`, `t3_no_handler_total`, `latency_us{tier}` histograms, `tick_hz` / `entities` / `channel_depth{tier}` / `rss_bytes` gauges.
- **Sensor data** (operator surface): `sensor_aggregate{type, stat=count|mean|min|max}` aggregated per second across the live ECS world. Cardinality bounded to `\|SensorType\| × 4` series independent of physical sensor count.
- Dashboards: [dashboards/runtime.json](dashboards/runtime.json) + [dashboards/sensors.json](dashboards/sensors.json).
- Verified: `--profile industrial --devices 2 --count 200` yields 10 entities and all 5 type aggregates with realistic values (T=20.5°C, RH=51%, P=1018 hPa, V=230.2 V, I=12 A).
- **M6 — Benchmark harness.** Sweep `entity_count ∈ {10k, 50k, 100k, 200k}` × `loss_rate ∈ {0%, 1%, 5%}` with 2k warmup + 5k measurement ticks. Loss via `tc netem` or in-app injection. Writes `data/loopback/final_table.csv`. *Verify:* one full sweep on M4 Max produces a CSV the Quarto figures consume.
- **M7 — CM5 cross-compile & deploy.** Exercise [Makefile:30](Makefile#L30) (`build-cm5`, `deploy-cm5`); set real `CM5_HOST`. *Verify:* binary runs on CM5 with a feed from M4 Max over 1 Gbps Ethernet.
- **M8 — Two-machine run + paper render.** Sweep with simulator on M4 Max → substrate on CM5; populate `data/two_machine/final_table.csv`; `make render` produces a PDF. **Update §Evaluation prose to reflect actual numbers.** Current paper figures (241 Hz, 64 µs / 15.8 ms P99, 2.6 µs jitter, 1.02 MB/1k, R²=0.9999) are **aspirational placeholders** — they may move and the conclusions may shift; that's expected.
## Conventions
- **Rust:** edition 2024; workspace at root with `simulator` + `substrate`; `opt-level=1` dev, `opt-level=3` for deps.
- **Pinned crates:** Bevy 0.18, Quinn 0.11, rustls 0.23, Tokio 1 (full), figment 0.10 (toml + env), uuid 1.23 (v4), serde 1.
- **Config:** `figment` chain — defaults in [substrate/src/config.rs:25](substrate/src/config.rs#L25) → `config.toml` → env `APP_*` (double-underscore for nesting, e.g. `APP_NETWORK__SERVER_PORT=9000`).
- **Bevy:** headless — `MinimalPlugins` only; do not pull rendering plugins.
- **Tokio↔Bevy:** keep the dedicated-thread + mpsc pattern in [substrate/src/transport/ecs.rs:49](substrate/src/transport/ecs.rs#L49); do not block the ECS schedule on async work.
- **Paper:** Quarto + LNCS template ([paper/_extensions/template.tex](paper/_extensions/template.tex), [paper/_quarto.yml](paper/_quarto.yml)). **Never commit `llncs.cls` or `splncs04.bst`** — CTAN licensing; download per [README.md:25-34](README.md#L25-L34).
- **Data:** raw CSVs under `data/` are committed; `*_processed.csv` is gitignored. Paper figures consume `data/loopback/final_table.csv` and `data/two_machine/final_table.csv`.
- **Build artifacts:** `target/`, `paper/_output/`, `paper/figures/`, `paper/.quarto/`, `paper/index.tex` all gitignored.
- **Errors:** `anyhow` (with `.context()`) for internal startup paths where the error type is uninteresting; `thiserror` for boundary types we want to match against (e.g. `WireError` in the codec).
- **Warnings:** let real warnings show. No `#[allow(dead_code)]`, `_var` blanket suppression, or `PhantomData` shims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands. See [feedback memory](../../.claude/projects/-Users-vplantevin-Projects-Research-quic-ecs-dt/memory/feedback_no_warning_hacks.md).
## Known deferrals
- **Channel ownership is per-host, not per-connection.** All connections share the same three mpsc channels. Fairness under N-device load relies on tokio scheduling. Acceptable for the "one ECS world per host" model the paper describes; revisit if many-device benchmarks show starvation.
- **No graceful shutdown.** The `quic-runtime` thread is parked on `pending()`; spawned tasks (accept loop, per-conn demux) are orphaned at process exit. Fine for research runs; we'll need an `OnExit(Started)` (or a `Stopping` state) when M5 observability needs clean drain or M8 wants finalised CSV writes.
- **Bind failure is fatal.** `OnEnter(Starting)` panics if `bind_endpoint` fails. A `ServerState::Failed` variant joins when we wire proper error surfacing.
- **T3 ack semantics are minimal.** The current handler echoes the device's most recent `raw_value` with a server timestamp — adequate for "read sensor" commands, not for actuator-write semantics. A future iteration may introduce an `ActuatorState` component and a setpoint-apply path; for now T3 is best framed as "reliable read/query RPC" in the paper.
- **`FaultInjectionSystem` is still empty.** Runs on schedule but does nothing. M6 fills it with rate-controlled in-app drop so loss sweeps don't depend on external `tc netem`.
- **Schedule rate-gating is approximate.** `ScheduleRunnerPlugin::run_loop(period)` honours `period` as a minimum; observed `tick_hz` runs ~85% of target on macOS dev (target 60 → ~50). Should be tighter on the CM5; revisit if M6 sweeps depend on a steady tick.
## Run / verify
```bash
make certs # generate certs/server.{crt,key} (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build # cargo build --release (native, depends on certs)
make build-cm5 # aarch64 cross-build for the CM5 (depends on certs)
make deploy-cm5 # scp to $CM5_HOST (set in env or override Makefile var)
make render # build the paper PDF
make preview # live-reload paper preview at :4848
make clean # cargo clean + drop generated paper outputs
```
`certs/` is gitignored; `make build` regenerates the dev cert if missing. From the repo root: `cargo run -p substrate` boots, prints the loaded `AppConfig`, and idles. `config.toml` and cert paths are resolved relative to the cwd — always launch from the repo root.
**Tests.** `cargo test --workspace` runs the codec unit tests in `substrate` plus the end-to-end integration tests in [simulator/tests/](simulator/tests/). Each integration test calls `bind_endpoint` + `accept_loop` in-process on `127.0.0.1:0` (OS-assigned port), connects a `SimulatorClient` against it, and asserts what arrives on the test-owned T1 receiver. Add a new `simulator/tests/end_to_end_*.rs` for each new wire path (T2 uni, T3 bi) as the substrate-side demux lands.
**Metrics scrape.** With `metrics_enabled = true` (default), the substrate exposes a Prometheus-format endpoint:
```bash
curl http://127.0.0.1:9100/metrics
```
A docker-compose stack under [monitoring/](monitoring/) brings up VictoriaMetrics + Grafana auto-provisioned: `make monitoring-up` then Grafana at <http://localhost:3000> (admin / admin), both dashboards under the `quic_ecs_dt` folder. The compose mounts [dashboards/](dashboards/) directly so any edit to the JSON files re-imports within 10 s.
Two Grafana dashboards under [dashboards/](dashboards/):
- [`runtime.json`](dashboards/runtime.json) — tick rate, RSS, per-tier received/dropped/latency, channel depth (paper §Evaluation surface).
- [`sensors.json`](dashboards/sensors.json) — thermometer + gauges + stat panels per `SensorType`, driven by `sensor_aggregate{type, stat}` (operator-facing surface).
Both use the `${datasource}` template variable so you can point them at any Prometheus-compatible source.
**Manual two-process run.** From the repo root, in two shells:
```bash
# shell 1 — server (use RUST_LOG=substrate=debug to see the per-conn summary)
cargo run -p substrate
# shell 2 — client; --help shows all flags
cargo run -p simulator -- --rate-hz 100 --count 0 --devices 4
```
Simulator flags (see `cargo run -p simulator -- --help`): `--addr`, `--server-name`, `--cert`, `--rate-hz` (T1 datagram rate; `0` disables T1), `--t2-rate-hz` / `--t3-rate-hz` (per-tier event rate; `0` disables), `--t3-timeout-ms` (T3 ack wait, default `2000`), `--count` (T1 count; `0` = until Ctrl-C), `--devices`, `--sensor-id`, `--sensor-type` (one of `generic|temperature|humidity|pressure|voltage|current`), `--profile` (`single` or `industrial` — 5 sensors per device on ids 0..4 covering all types). The client logs a one-second `progress` line with `t1_sent`/`t2_sent`/`t3_sent`/`t3_timeouts`/per-tier observed Hz, and a final `simulator done` line with elapsed time on exit.
## Key references
- Prior self-citations: `plantevin2026ecs`, `plantevin2026quic` (both IEEE SWC 2026, "to appear").
- QUIC: RFC 9000 (core), RFC 9221 (unreliable datagrams).
- DT foundations: Tao et al. 2019; Grieves & Vickers 2017; Minerva et al. 2020.
- ECS: Nystrom 2014, *Game Programming Patterns*.
- Mixed-reliability transport: Peeck et al. (W2RP for DDS).
- DT sync metrics: Çakır et al. 2023 (Twin Alignment Ratio); Bellavista et al. 2023 (ODTE).
- Industrial QUIC/IIoT: Fernández et al. 2021; Boeding et al. 2025.
- Full bibliography: [paper/references.bib](paper/references.bib).