Update to the text and small demo

This commit is contained in:
Valère Plantevin
2026-05-13 17:22:10 -04:00
parent 872bbb8c2c
commit a7b8065739
4 changed files with 168 additions and 109 deletions

226
CLAUDE.md
View File

@@ -2,12 +2,12 @@
## What & why
Source repo for **"QUIC + ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins"** — UCAmI 2026 (Plantevin & Francillette, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:
Source repo for **"QUIC and ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins: An Integrated Empirical Study"** — submitted to **UCAmI 2026** (Track 2: *Internet of EveryThing (IoT, People & Processes) and Sensors*; primary topic *IoE interoperability, integration and performance*, secondary topic *IoE experimental results and deployment scenarios*). Single-author (Plantevin, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:
- `plantevin2026ecs` — ECS as runtime substrate for industrial DT (200k assets @ 114 Hz on Pi 5).
- `plantevin2026quic` — QUIC partial reliability for DT sensor streams (94% P99 reduction vs TCP at 5% loss).
**UCAmI hypothesis (the composition question):** prior work shows ECS and QUIC each work as substrates *independently*. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it.
**UCAmI hypothesis (the composition question):** prior work shows ECS and QUIC each work as substrates *independently*. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it on a real CM5 ↔ M4 Max two-machine deployment.
## Architecture
@@ -19,13 +19,13 @@ Three-tier QUIC ↔ ECS bridge, headless Bevy runtime. **T1/T2 are inbound (devi
| T2 | Unidirectional streams | device → substrate | Ordered threshold events; reliable | 512 | `T2Sender::send` (await, backpressure) |
| T3 | Bidirectional streams | **substrate → device** | Actuator commands w/ ACK | 256 | `T3OutboundSender::try_send` of `OutboundT3 { target_device, sensor_id, raw_value, sensor_type }` |
QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime. T1/T2 decoded `QuicMessage`s (39 B fixed LE: UUID + sensor_id + f64 + ts + seq + sensor_type) flow into per-tier `tokio::sync::mpsc` channels and are drained by Bevy's `ingest_system` in `PreUpdate`, gated by `run_if(in_state(ServerState::Started))`. T3 flows the other way: `automation_system` constructs `OutboundT3` items and the tokio-side `drain_outbound_t3` task opens bi-streams to the target device. The per-tier sender newtypes (in [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs)) make tier mixups a type error. Pattern is in [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs).
QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime. T1/T2 decoded `QuicMessage`s (39 B fixed LE: 16 UUID + 2 sensor_id + 8 f64 + 8 ts + 4 seq + 1 sensor_type) flow into per-tier `tokio::sync::mpsc` channels and are drained by Bevy's `ingest_system` in `PreUpdate`, gated by `run_if(in_state(ServerState::Started))`. T3 flows the other way: `automation_system` constructs `OutboundT3` items and the tokio-side `drain_outbound_t3` task opens bi-streams to the target device. The per-tier sender newtypes (in [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs)) make tier mixups a type error. Pattern in [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs).
**T3 actuator-command protocol.** The substrate's `automation_system` decides to actuate (e.g. Presence < 1.0 ⇒ Relay = stop) and pushes an `OutboundT3` onto the outbound channel. The tokio drain task pops it, looks up the target device's `quinn::Connection` in a `ConnectionRegistry` (populated by `read_datagrams` / `read_one_uni_stream` on first sight of each device UUID), then **spawns one task per command** to do `conn.open_bi() → write 39 B → finish → read 39 B ack`. Per-task spawning means a single stuck `read_exact` can't stall the pipeline. Latency from `open_bi()` to ack-receipt is recorded as `substrate_latency_us{tier="t3"}` and a successful ack increments `substrate_received_total{tier="t3"}`. Misses (`substrate_t3_outbound_no_route_total`), drops (`substrate_t3_outbound_dropped_total`), and bi-stream errors (`substrate_t3_outbound_errors_total`) each have their own counter.
**T3 actuator-command protocol.** The substrate's `automation_system` decides to actuate (e.g. Presence < 1.0 ⇒ Relay = stop) and pushes an `OutboundT3` onto the outbound channel. The tokio `drain_outbound_t3` pops it, looks up the target device's `quinn::Connection` in a `ConnectionRegistry` (populated by `read_datagrams` / `read_one_uni_stream` on first sight of each device UUID), then **spawns one task per command** to do `conn.open_bi() → write 39 B → finish → read 39 B ack`. Per-task spawning means a single stuck `read_exact` can't stall the pipeline. Latency from `open_bi()` to ack-receipt is recorded as `substrate_latency_us{tier="t3"}` and a successful ack increments `substrate_received_total{tier="t3"}`. Misses (`substrate_t3_outbound_no_route_total`), drops (`substrate_t3_outbound_dropped_total`), and bi-stream errors (`substrate_t3_outbound_errors_total`) each have their own counter.
**Connection registry.** `Arc<std::sync::RwLock<HashMap<Uuid, quinn::Connection>>>`. `quinn::Connection` is internally `Arc`; one simulator process commonly hosts 7 device UUIDs sharing one connection. Registry insert is idempotent (`ensure_registered`). On `conn.closed().await` returning, `handle_incoming` purges every key whose `Connection::stable_id()` matches the closed connection.
**Target hardware:** CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand.
**Target hardware:** CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand; benchmark sweeps live on the CM5.
## Repo map
@@ -34,125 +34,175 @@ quic_ecs_dt/
├── paper/ Quarto + LNCS source — single index.qmd, refs in references.bib
├── substrate/ Rust crate: Bevy 0.18 + Quinn 0.11 + rustls 0.23 + Tokio
│ └── src/
│ ├── main.rs App::new, MinimalPlugins, EcsQuicTransportPlugin
│ ├── config.rs figment chain: defaults → config.toml → APP_* env
── transport/
├── mod.rs QuicMessage struct
├── ecs.rs Plugin: tokio thread + 3 mpsc + PreUpdate ingest
── server.rs run_substrate_server (EMPTY STUB)
├── simulator/ Rust crate: stub today; will be Quinn client + Bevy sensor generators
├── data/ (created by M6) loopback/, two_machine/ — raw CSVs committed, *_processed ignored
├── Cargo.toml workspace
└── Makefile render, preview, build, build-cm5, deploy-cm5
│ ├── main.rs App::new, MinimalPlugins, EcsQuicTransportPlugin, ObservabilityPlugin
│ ├── lib.rs re-exports
── config.rs figment chain: defaults → config.toml → APP_* env (split on "__")
├── observability.rs metrics-exporter-prometheus on :9100
├── transport/
── mod.rs QuicMessage codec + tier sender newtypes + OutboundT3
├── ecs.rs EcsQuicTransportPlugin: tokio thread + bridge + registry + drain spawn
├── server.rs bind_endpoint + accept_loop + read_datagrams + read_uni_streams
│ │ │ + drain_outbound_t3 + synthetic_t3_driver + ConnectionRegistry
└── state.rs ServerState{Starting, Started}
│ └── world/
│ ├── mod.rs WorldPlugin (5 systems wired into Pre/Update/Post)
│ ├── components.rs Asset, DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, threshold_for
│ ├── resources.rs SensorRegistry, DiagnosticsState, ExportSampleState
│ ├── systems.rs ingest, simulation, automation, export, diagnostics
│ └── tests.rs 8 unit tests inc. automation_dispatches_relay_stop
├── simulator/ Rust crate: Quinn client + sensor generators + T3 receiver
│ ├── src/
│ │ ├── main.rs CLI driver + HTTP-trigger task + T1 inline loop
│ │ ├── lib.rs module exports
│ │ ├── client.rs SimulatorClient (connect, send_datagram, send_uni_stream, request, close)
│ │ ├── commands.rs run_command_receiver (substrate → device T3 accept-bi loop)
│ │ ├── emitters.rs run_t2_emitter (T1 lives inline in main.rs)
│ │ └── profile.rs SensorProfile (single | industrial), generate_value
│ └── tests/ T1, T2, end-to-end full-loop integration tests
├── data/
│ ├── two_machine/ CM5 ↔ M4 Max sweep — final_table.csv (load-bearing for the paper)
│ └── local/ loopback sweeps (scaling.csv, cross_tier.csv)
├── scripts/
│ ├── bench-loss.sh M6 sweep entities×loss → data/two_machine/final_table.csv
│ ├── bench-scaling.sh T1 rate sweep + optional synthetic-T3 cross-tier mode
│ ├── bench-client.sh M8 client driver (run from Mac when substrate is on CM5)
│ ├── demo.sh full-stack demo: certs + build + VM/Grafana + sub + sim
│ ├── setup-cm5.sh CM5 provisioning (apt + cargo install)
│ └── verify-netem.sh confirm tc-netem is shaping in the right direction (BIDI=1 for ifb mode)
├── monitoring/ docker-compose: VictoriaMetrics + Grafana auto-provisioned
├── dashboards/ runtime.json + sensors.json
├── certs/ gitignored, regenerated by `make certs`
├── Cargo.toml workspace
└── Makefile render, preview, build, build-cm5, deploy-cm5, monitoring-up
```
## Status
**Code (substrate + simulator):**
| Area | State |
|------|-------|
| `AppConfig` figment loader (defaults → TOML → env, `__` split) | Done — [substrate/src/config.rs](substrate/src/config.rs) |
| Inbound bridge scaffolding (Tokio thread + Bevy plugin) | Done — [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs) |
| `QuicMessage` struct + 39 B LE codec | Done — [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs); 5 unit tests passing |
| Quinn server lifecycle | Listener up — `ServerState{Starting,Started}` in [substrate/src/transport/state.rs](substrate/src/transport/state.rs); `OnEnter(Starting)` → bind + accept loop in [substrate/src/transport/ecs.rs](substrate/src/transport/ecs.rs). Explicit `TransportConfig` w/ tuned datagram recv buffer (256 KiB) in [substrate/src/transport/server.rs](substrate/src/transport/server.rs). Per-tier sender newtypes (`T1Sender::send_lossy`, `T2Sender::send`, `T3OutboundSender::try_send`) in [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs) |
| T1 demux (datagrams → ECS) | Done `handle_incoming` orchestrator + `read_datagrams` reader in [substrate/src/transport/server.rs](substrate/src/transport/server.rs); decode errors logged but non-fatal; channel-full drops silent at trace; received/dropped/decode_errors counters in the end-of-stream debug line. Calls `ensure_registered` on first decode so outbound T3 can route to this device |
| T2 demux (uni streams → ECS) | Done `read_uni_streams` accepts streams in [substrate/src/transport/server.rs](substrate/src/transport/server.rs), spawns one task per stream that reads 39 B chunks until EOF; decode failure resets the stream via `recv.stop(0)` (one bad stream doesn't kill the connection); `t2.send().await` honours backpressure; first decode also calls `ensure_registered` |
| T3 outbound (ECS → device, substrate-initiated) | Done — `drain_outbound_t3` task in [substrate/src/transport/server.rs](substrate/src/transport/server.rs) pops `OutboundT3` items, looks up the target device's `Connection` in `ConnectionRegistry`, **spawns one task per command** to do `open_bi → write 39 B → finish → read ack`. Per-task spawning ensures one stuck ack can't stall the pipeline. Records `substrate_latency_us{tier="t3"}` on success; counts no-route, dropped, and error cases separately. The old simulator-initiated T3 inbound path (`T3Sender` / `T3Inbound` / `accept_bi_streams`) is **gone** as of this refactor |
| Connection registry (Uuid → Connection) | Done `Arc<RwLock<HashMap<Uuid, quinn::Connection>>>` populated by readers; purged in `handle_incoming` after `conn.closed().await` using `Connection::stable_id()`. Constructor `new_connection_registry`; idempotent insert via `ensure_registered` |
| TLS / self-signed cert | Done (M1) — `certs/server.{crt,key}` via `make certs`, gitignored. PEM loader in [substrate/src/transport/server.rs:15](substrate/src/transport/server.rs#L15); rustls `aws-lc-rs` default provider installed in [substrate/src/main.rs](substrate/src/main.rs) |
| Wire codec for `QuicMessage` (39 B fixed LE, incl. `sensor_type: u8`) | Done — [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs); 5 unit tests passing. `SensorType` enum: `Generic / Temperature / Humidity / Pressure / Voltage / Current` |
| `tracing-subscriber` init w/ `RUST_LOG` | Done (M1) — [substrate/src/main.rs:8-12](substrate/src/main.rs#L8-L12) |
| ECS components (`RawSensorData`, `SmoothedValue`) + 4 systems (Ingest/Sim/Export/Diagnostics) | Done — entities = `(DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, Asset)` per (device, sensor); `SensorRegistry` upserts via `HashMap<(Uuid, u16), Entity>` in [substrate/src/world.rs](substrate/src/world.rs). `IngestSystem` drains all three tiers; T3 ack preserves command's `sensor_type` and returns the device's most recent `raw_value`. `SimulationSystem` maintains a 16-sample rolling mean per entity and emits `substrate_threshold_crossings_total{type, direction}` when the smoothed mean crosses a per-type threshold (`Changed<RawSensorData>` query so cost scales with ingress, not fleet size). `ExportSystem` samples `substrate_{entities,channel_depth,channel_capacity,rss_bytes}` + `sensor_aggregate{type, stat}` once per second. `Diagnostics` logs `tick_hz` once per second |
| Schedule rate-gating | Done (M4) — `MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz))` in [substrate/src/main.rs](substrate/src/main.rs); replaces the default busy-loop with the configured period |
| Prometheus exporter + Grafana dashboards | Done (M5) — `ObservabilityPlugin` in [substrate/src/observability.rs](substrate/src/observability.rs) installs `metrics-exporter-prometheus` on the existing tokio runtime. **Runtime surface** (paper §Evaluation): counters `substrate_received_total{tier}`, `dropped_total{tier=t1}`, `decode_errors_total{tier}`, `t3_no_handler_total`; latency histograms `substrate_latency_us{tier}`; gauges `substrate_tick_hz`, `substrate_entities`, `substrate_channel_depth{tier}`, `substrate_channel_capacity{tier}`, `substrate_rss_bytes`. **Sensor data surface** (operator dashboard): per-type aggregates `sensor_aggregate{type, stat=count|mean|min|max}` computed once per second over the live world, cardinality bounded by `\|SensorType\| × 4` so it scales to thousands of sensors. Two dashboards: [dashboards/runtime.json](dashboards/runtime.json) and [dashboards/sensors.json](dashboards/sensors.json) (thermometer/gauge/stat panels per type) |
| Simulator (Quinn client + sensor generators) | `SimulatorClient` lib in [simulator/src/client.rs](simulator/src/client.rs) — connects, trusts the substrate's PEM cert via custom `ServerCertVerifier` (sidesteps `CaUsedAsEndEntity`); `send_datagram(QuicMessage)` for T1, `send_uni_stream(&[QuicMessage])` for T2. `SimulatorClient::request` exists for ad-hoc tests but the binary no longer initiates T3. CLI driver in [simulator/src/main.rs](simulator/src/main.rs) with clap flags (`--addr`, `--rate-hz`, `--t2-rate-hz`, `--count`, `--devices`, `--sensor-id`, `--sensor-type`, `--profile`, `--cert`, `--server-name`). `--profile industrial` fans out to **7 sensors per device** (Temperature/Humidity/Pressure/Voltage/Current/Presence/Relay). T1/T2 emitters check `engine_running` per-tick — Voltage stays at ~230 V regardless; Current drops to ~0 when stopped. HTTP trigger on `:9002` (`POST /trigger`) pushes a Presence=0 reading via T2 for Grafana-driven demos |
| Simulator command receiver (substrate → device T3) | Done — `run_command_receiver` in [simulator/src/commands.rs](simulator/src/commands.rs) loops on `conn.accept_bi()`, decodes 39 B, sets `engine_running` from `raw_value` when `sensor_type == Relay`, writes 39 B ack. Spawned by `main.rs` post-connect. `new_engine_state()` constructor exported for integration tests |
| End-to-end test harness | 18 tests across [simulator/tests/end_to_end_t1.rs](simulator/tests/end_to_end_t1.rs), [simulator/tests/end_to_end_t2.rs](simulator/tests/end_to_end_t2.rs), [simulator/tests/end_to_end_full_loop.rs](simulator/tests/end_to_end_full_loop.rs): T1 single-datagram + 32-msg burst order; T2 single-stream + 4-stream concurrent ordering; **full closed loop** (Presence < 1.0 → substrate T3 → simulator `engine_running` flips, then Presence > 1.0 → flips back). Plus codec + world unit tests including `automation_dispatches_relay_stop_when_presence_drops` |
| `config.toml` at repo root | Done — [config.toml](config.toml); loaded by [substrate/src/main.rs](substrate/src/main.rs); env override via `APP_*` with `__` split (`Env::prefixed("APP_").split("__")`) actually works now |
| Benchmark harness (sweep + CSV writer) | Done — [scripts/bench-loss.sh](scripts/bench-loss.sh) for entity×loss → `data/two_machine/final_table.csv`; [scripts/bench-scaling.sh](scripts/bench-scaling.sh) for T1 rate sweep with optional substrate-side synthetic T3 (`T3_RATE_HZ=100 ./scripts/bench-scaling.sh` enables `APP_NETWORK__SYNTHETIC_T3_RATE_HZ`) → `data/local/cross_tier.csv`. The synthetic driver lives in `accept_loop` and pushes through the same outbound channel `automation_system` uses |
| CM5 cross-compile / deploy | Wired in [Makefile:30](Makefile#L30); first trial run completed (commit `272d3b3`); [scripts/setup-cm5.sh](scripts/setup-cm5.sh) provisions the Pi |
| `AppConfig` figment loader (defaults → TOML → env with `__` split) | Done — [substrate/src/config.rs](substrate/src/config.rs). Env override actually works (`Env::prefixed("APP_").split("__")`); discovered late that the previous chain silently ignored env vars |
| 39 B wire codec | Done — [substrate/src/transport/mod.rs](substrate/src/transport/mod.rs), 5 unit tests |
| Quinn server lifecycle + TLS | Done — `bind_endpoint` + `accept_loop` in [substrate/src/transport/server.rs](substrate/src/transport/server.rs); `ServerState{Starting, Started}` in [state.rs](substrate/src/transport/state.rs); explicit `TransportConfig` w/ 256 KiB datagram recv buffer; dev cert via `make certs`, rustls `aws-lc-rs` provider installed in [main.rs](substrate/src/main.rs) |
| T1 demux (datagrams → ECS) | Done. `read_datagrams` reader; decode errors non-fatal; channel-full drops silent; per-stream counters in debug summary. Calls `ensure_registered` on first decode so outbound T3 can route to this device |
| T2 demux (uni streams → ECS) | Done. `read_uni_streams` accepts streams, spawns one task per stream that reads 39 B chunks until EOF; decode failure resets the stream via `recv.stop(0)`; `t2.send().await` honours backpressure; first decode also calls `ensure_registered` |
| T3 outbound (ECS → device) | Done. `drain_outbound_t3` task pops `OutboundT3` items, looks up the target device's `Connection` in `ConnectionRegistry`, **spawns one task per command** to do `open_bi → write 39 B → finish → read ack`. Per-task spawning prevents a single stuck `read_exact` from stalling the pipeline. Records `substrate_latency_us{tier="t3"}` on success; counts no-route / dropped / errors separately. The old simulator-initiated T3 inbound path (`T3Sender` / `T3Inbound` / `accept_bi_streams`) is **gone** |
| Connection registry (Uuid → Connection) | Done — `Arc<RwLock<HashMap<Uuid, quinn::Connection>>>`; idempotent insert via `ensure_registered`; purged in `handle_incoming` after `conn.closed().await` using `Connection::stable_id()` |
| Synthetic T3 driver (bench only) | Done. `synthetic_t3_driver` task in [server.rs](substrate/src/transport/server.rs) spawned by `accept_loop` when `APP_NETWORK__SYNTHETIC_T3_RATE_HZ > 0`. Round-robins over registered devices, toggles `raw_value` between 0/1, pushes through the same outbound channel `automation_system` uses |
| ECS components + 5 systems | Done — [world/](substrate/src/world/). Entities = `(Asset, DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue)` per (device, sensor). 5 systems: `ingest` (PreUpdate, drains T1+T2), `simulation` (Update, rolling mean + threshold-crossings counter), `automation` (Update, Presence-cross → `t3_out.try_send(OutboundT3{Relay setpoint})` + local mirror), `export` (PostUpdate, per-second metric sample), `diagnostics` (PostUpdate, per-second `tick_hz` log) |
| Schedule rate-gating | Done — `MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz))` in [main.rs](substrate/src/main.rs) |
| Prometheus exporter + Grafana | Done. `metrics-exporter-prometheus` on :9100 via `ObservabilityPlugin`. Runtime metrics: `substrate_received_total{tier}`, `substrate_dropped_total{tier=t1}`, `substrate_decode_errors_total{tier}`, `substrate_t3_outbound_*_total`, `substrate_latency_us{tier}` histograms, `substrate_tick_hz`, `substrate_entities`, `substrate_channel_depth{tier}`, `substrate_rss_bytes`. Sensor data: `sensor_aggregate{type, stat=count\|mean\|min\|max}`. Dashboards: [dashboards/runtime.json](dashboards/runtime.json) + [dashboards/sensors.json](dashboards/sensors.json) |
| Simulator binary | Done — [simulator/src/main.rs](simulator/src/main.rs). Clap flags: `--addr`, `--server-name`, `--cert`, `--profile {single, industrial}`, `--sensor-type`, `--sensor-id`, `--rate-hz`, `--t2-rate-hz`, `--count`, `--devices`. `industrial` profile fans out to **7 sensors per device** on ids 0..6 (Temperature/Humidity/Pressure/Voltage/Current/Presence/Relay). HTTP trigger on `:9002` (`POST /trigger`) pushes Presence=0 over T2 — operator-facing demo entry point. T1/T2 emitters check `engine_running` per tick; when `false`, Current waveform drops to ~0 while Voltage stays at ~230 V |
| Simulator command receiver | Done — [simulator/src/commands.rs](simulator/src/commands.rs). `run_command_receiver` loops on `conn.accept_bi()`, decodes 39 B, flips `engine_running` on `sensor_type == Relay` setpoints, writes 39 B ack. Spawned by `main.rs` post-connect. `new_engine_state()` constructor exported for integration tests |
| End-to-end test harness | **18 tests, all green.** 5 codec unit tests; 8 world unit tests (incl. `automation_dispatches_relay_stop_when_presence_drops`); 2 T1 + 2 T2 integration tests; 1 **full closed-loop** test (`simulator/tests/end_to_end_full_loop.rs`: Presence < 1.0 → substrate T3 → `engine_running` flips to false; then Presence > 1.0 → flips back) |
| Benchmark scripts | Done. [bench-loss.sh](scripts/bench-loss.sh) — entity × loss sweep, **bidirectional `tc-netem` via `ifb` on the CM5** (BIDI=1 default). [bench-scaling.sh](scripts/bench-scaling.sh) — T1 rate sweep + optional substrate-side `APP_NETWORK__SYNTHETIC_T3_RATE_HZ`. [verify-netem.sh](scripts/verify-netem.sh) — sanity-check netem on the right interface in the right direction (BIDI=1 mode covers ingress via ifb) |
| CM5 deploy | Done — `make build-cm5 && make deploy-cm5`; [setup-cm5.sh](scripts/setup-cm5.sh) provisions deps. Bench has been run end-to-end on CM5; data lives in [data/two_machine/final_table.csv](data/two_machine/final_table.csv) |
`cargo run -p substrate` boots, prints the loaded config, and idles on the (still-empty) Quinn server. `MinimalPlugins` busy-loops the ECS schedule by default — expected, will gate to `tick_rate_hz` in M4.
**Paper:**
| Area | State |
|------|-------|
| Track + topics chosen | Done — UCAmI Track 2 (IoE and Sensors); primary *IoE interoperability, integration and performance*; secondary *IoE experimental results and deployment scenarios* |
| Abstract | Done. Honest framing: "tick rate remains an order of magnitude above the cadence required" (not "stable"), mixed-reliability isolation as the T1-vs-T3 story, 0.12 MB/1k slope |
| Tables 2/3/4 from real CM5 data | Done. Native markdown tables driven by inline `{python}` values reading from `data/two_machine/final_table.csv`; cross-refs (`@tbl-latency`, `@tbl-throughput`, `@tbl-t3-rtt`) resolve in the LNCS LaTeX output. Earlier `display(Markdown(...))` approach didn't register with Quarto's cross-ref filter; switched to native md tables with inline-python cells |
| `fig-isolation` | **Dropped.** Cross-tier story now told by `tbl-latency` + `tbl-t3-rtt` (T1 flat under loss, T3 absorbs ~38 ms retransmit). Cleaner than the loopback fig. `data/local/cross_tier.csv` is still on disk but the paper no longer reads it |
| Architecture §3 + Table 1 | Updated for substrate-initiated T3. Table 1 T3 row reads "OutboundT3 enqueue + ack \| Bidirectional stream (server-initiated)"; the connection-registry / per-device routing is described in the prose |
| Implementation §4 Automation paragraph | Updated for the new outbound T3 path; describes the per-device registry, the per-command bi-stream, and the simulator-side `run_command_receiver` engine-state flip |
| Discussion + Conclusion | Honest now: drops the unbacked "<5% IngestSystem drain" and "Grafana adds no overhead" claims; conclusion populates both 0%-loss and 5%-loss Hz from data |
| Render | Clean against LNCS LaTeX template (`make render` → 10-page PDF, no Quarto warnings) |
## Roadmap
Each milestone has one verification gate. Update Status here as we go.
Treat the milestone log as historical. The paper-side work below tracks what's *left* before camera-ready.
- **M1 — Wire codec & root config.** ✅ Done 2026-05-04. Hand-rolled little-endian codec on `QuicMessage` (38 B fixed: 16 UUID + 2 stream_id + 8 f64 + 8 ts_us + 4 seq) with roundtrip + layout + length-error tests; `config.toml` at repo root; dev TLS via `make certs`; structured `tracing-subscriber` init reads `RUST_LOG` (default `info`).
- **M2 — Quinn server + self-signed TLS.** ✅ Done 2026-05-06. Listener up under `ServerState::Starting/Started`; type-system tier semantics + T3 oneshot ack protocol; per-connection `handle_incoming` orchestrator joining T1 datagram, T2 uni-stream, and T3 bi-stream readers. T1 has dropped/decoded counters; T2 resets a stream on decode failure without killing the connection; T3 ships `T3Inbound { command, reply }` to the ECS and resets the stream when no handler answers. End-to-end coverage: 6 integration tests in [simulator/tests/](simulator/tests/) plus 4 codec unit tests, all green.
- **M3 — Simulator client.** Replace [simulator/src/main.rs](simulator/src/main.rs) with a Bevy app: Quinn client, N synthetic devices, configurable per-tier rates. *Verify:* end-to-end loopback drains messages on all three tiers. **Status (2026-05-05):** simulator made into a lib + bin; `SimulatorClient::{connect,send_datagram,close}` plus a manual smoke runner in `simulator/src/main.rs`. Two integration tests in `simulator/tests/end_to_end_t1.rs` exercise the full T1 path against an in-process substrate. Bevy-driven generator + T2/T3 helpers + load profiles still pending.
- **M4 — ECS world.** ✅ Done. `Asset` + `DeviceId` + `SensorId` + `SensorTypeTag` + `RawSensorData` + `SmoothedValue` components in [substrate/src/world.rs](substrate/src/world.rs); `SensorRegistry` resource for O(1) `(Uuid, u16) → Entity`. `IngestSystem` drains all three tiers (T1 batched, T2/T3 fully); T3 handler returns the latest sensor value as ack. `SimulationSystem` runs a per-entity 16-sample rolling mean and emits `substrate_threshold_crossings_total{type, direction}` on per-type threshold crossings — gives the ECS observable digital-twin work, not just write-through ingest. `ExportSystem` samples `substrate_{entities,channel_depth,channel_capacity,rss_bytes}` + `sensor_aggregate{type, stat}` once per second. `DiagnosticsSystem` logs tick rate once per second. Schedule rate-gated via `ScheduleRunnerPlugin::run_loop(1/tick_rate_hz)`. 8 unit tests passing (entity create, in-place update, T3 ack, SmoothedValue push/window/non-finite/full-roll, threshold-crossing transition).
- **M5 — Observability (VictoriaMetrics + Grafana).** ✅ Done. Wire format extended to carry `sensor_type: u8` (38 → 39 B, decoded into `SensorType` enum). Two metric surfaces over `metrics-exporter-prometheus`:
- **Runtime** (paper §Evaluation): `substrate_received_total{tier}`, `dropped_total{tier=t1}`, `decode_errors_total{tier}`, `t3_no_handler_total`, `latency_us{tier}` histograms, `tick_hz` / `entities` / `channel_depth{tier}` / `rss_bytes` gauges.
- **Sensor data** (operator surface): `sensor_aggregate{type, stat=count|mean|min|max}` aggregated per second across the live ECS world. Cardinality bounded to `\|SensorType\| × 4` series independent of physical sensor count.
- Dashboards: [dashboards/runtime.json](dashboards/runtime.json) + [dashboards/sensors.json](dashboards/sensors.json).
- Verified: `--profile industrial --devices 2 --count 200` yields 10 entities and all 5 type aggregates with realistic values (T=20.5°C, RH=51%, P=1018 hPa, V=230.2 V, I=12 A).
- **M6Benchmark harness.** Sweep `entity_count ∈ {10k, 50k, 100k, 200k}` × `loss_rate ∈ {0%, 1%, 5%}` with 2k warmup + 5k measurement ticks. Loss via `tc netem`. Writes `data/loopback/final_table.csv`. *Verify:* one full sweep on M4 Max produces a CSV the Quarto figures consume.
- **M7 — CM5 cross-compile & deploy.** Exercise [Makefile:30](Makefile#L30) (`build-cm5`, `deploy-cm5`); set real `CM5_HOST`. *Verify:* binary runs on CM5 with a feed from M4 Max over 1 Gbps Ethernet.
- **M8 — Two-machine run + paper render.** Sweep with simulator on M4 Max → substrate on CM5; populate `data/two_machine/final_table.csv`; `make render` produces a PDF. **Update §Evaluation prose to reflect actual numbers.** Current paper figures (241 Hz, 64 µs / 15.8 ms P99, 2.6 µs jitter, 1.02 MB/1k, R²=0.9999) are **aspirational placeholders** — they may move and the conclusions may shift; that's expected.
- **M1 — Wire codec & root config.** ✅ 2026-05-04.
- **M2 — Quinn server + TLS.** ✅ 2026-05-06.
- **M3 — Simulator client.** ✅ Done. `SimulatorClient` + CLI driver + waveform profiles + HTTP trigger + closed-loop command receiver.
- **M4 — ECS world.** ✅ Done. 5 systems wired; automation closes the T3 loop.
- **M5 — Observability.** ✅ Done. Both dashboards live; metrics exposed via prometheus scrape.
- **M6 — Benchmark harness.** ✅ Done. `bench-loss.sh` + `bench-scaling.sh` + `verify-netem.sh` (last one added when egress-only netem was masking the inbound T1 loss path; now `ifb` ingress shaping is default).
- **M7 — CM5 cross-compile & deploy.** ✅ Done. Multiple sweeps shipped from CM5.
- **M8 — Two-machine run + paper render.** ✅ Done. Paper renders against [data/two_machine/final_table.csv](data/two_machine/final_table.csv); all inline scalars and tables populate from real numbers.
- **M9 — T3 inversion (substrate-initiated actuator commands).** ✅ 2026-05-13. The paper's Table 1 said T3 was "actuator commands" but the code had it inverted (device → substrate RPC). Refactored to match the paper: substrate opens bi-streams, simulator's `run_command_receiver` accepts. Full closed-loop integration test landed.
- **M10Abstract submission polish.** ⏳ In progress. Top-of-paper fixes shipped (abstract framing, contributions paragraph, Table 1 T3 row, Architecture §3 backpressure paragraph, author affiliation, `(author?)` cite markers). Remaining polish is full-paper-only (Implementation §4 module-list lies, code listing with fake types, Observability §4.2 push-vs-pull mismatch, Experimental Setup §5.1 stale tc-netem / tick counts / loopback-vs-two-machine sentence). None block abstract submission.
**Open polish items** (not blocking abstract submission):
- §4.1 *Integrated Prototype* still lists six systems including a non-existent `FaultInjection`; module list says `transport.rs` / `world.rs` / `metrics.rs` / `main.rs` but the actual layout is `transport/`, `world/`, `observability.rs`, `config.rs`, `main.rs`, `lib.rs` plus a separate `simulator` crate.
- §4.1 code listing uses fictional types (`AssetId`, `EntityMap`, `TickDiagnostics`). Easier to drop the listing than to rewrite faithfully.
- §4.2 *Observability Stack* describes a push model with InfluxDB line protocol; actual code uses `metrics-exporter-prometheus` exposing `/metrics` for VM scrape.
- §5.1 *Experimental Setup* needs three updates: tc-netem direction (now bidirectional via `ifb`), "2,000 warmup ticks and 5,000 measurement ticks" → "20 s warmup + 50 s window (wall-clock)", and drop the "loopback for latency / two-machine for throughput" sentence (all numbers are from the two-machine sweep now).
## Conventions
- **Rust:** edition 2024; workspace at root with `simulator` + `substrate`; `opt-level=1` dev, `opt-level=3` for deps.
- **Rust:** edition 2024; workspace at root with `simulator` + `substrate`.
- **Pinned crates:** Bevy 0.18, Quinn 0.11, rustls 0.23, Tokio 1 (full), figment 0.10 (toml + env), uuid 1.23 (v4), serde 1.
- **Config:** `figment` chain — defaults in [substrate/src/config.rs:25](substrate/src/config.rs#L25)`config.toml` → env `APP_*` (double-underscore for nesting, e.g. `APP_NETWORK__SERVER_PORT=9000`).
- **Config:** `figment` chain — defaults → `config.toml` → env `APP_*` with `__` nesting (e.g. `APP_NETWORK__SERVER_PORT=9000`, `APP_NETWORK__SYNTHETIC_T3_RATE_HZ=100`).
- **Bevy:** headless — `MinimalPlugins` only; do not pull rendering plugins.
- **Tokio↔Bevy:** keep the dedicated-thread + mpsc pattern in [substrate/src/transport/ecs.rs:49](substrate/src/transport/ecs.rs#L49); do not block the ECS schedule on async work.
- **Paper:** Quarto + LNCS template ([paper/_extensions/template.tex](paper/_extensions/template.tex), [paper/_quarto.yml](paper/_quarto.yml)). **Never commit `llncs.cls` or `splncs04.bst`** — CTAN licensing; download per [README.md:25-34](README.md#L25-L34).
- **Data:** raw CSVs under `data/` are committed; `*_processed.csv` is gitignored. Paper figures consume `data/loopback/final_table.csv` and `data/two_machine/final_table.csv`.
- **Build artifacts:** `target/`, `paper/_output/`, `paper/figures/`, `paper/.quarto/`, `paper/index.tex` all gitignored.
- **Errors:** `anyhow` (with `.context()`) for internal startup paths where the error type is uninteresting; `thiserror` for boundary types we want to match against (e.g. `WireError` in the codec).
- **Warnings:** let real warnings show. No `#[allow(dead_code)]`, `_var` blanket suppression, or `PhantomData` shims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands. See [feedback memory](../../.claude/projects/-Users-vplantevin-Projects-Research-quic-ecs-dt/memory/feedback_no_warning_hacks.md).
- **Tokio↔Bevy:** keep the dedicated-thread + mpsc pattern in [transport/ecs.rs](substrate/src/transport/ecs.rs); do not block the ECS schedule on async work.
- **Paper:** Quarto + LNCS template ([paper/_extensions/template.tex](paper/_extensions/template.tex), [paper/_quarto.yml](paper/_quarto.yml)). **Never commit `llncs.cls` or `splncs04.bst`** — CTAN licensing; download per [README.md](README.md). For tables in LaTeX target, use native markdown tables with `: Caption {#tbl-foo}` syntax and inline `{python}` cells, **not** `display(Markdown(...))` chunks — Quarto's cross-ref filter doesn't pick the latter up in LaTeX output.
- **Data:** raw CSVs under `data/` are committed; `*_processed.csv` is gitignored. Paper figures consume `data/two_machine/final_table.csv` exclusively (the previous `data/loopback/` was renamed to `data/two_machine/` once it became the real CM5 sweep).
- **Errors:** `anyhow` (with `.context()`) for internal startup paths; `thiserror` for boundary types we want to match against (e.g. `WireError` in the codec).
- **Warnings:** let real warnings show. No `#[allow(dead_code)]`, `_var` blanket suppression, or `PhantomData` shims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands.
## Known deferrals
- **Channel ownership is per-host, not per-connection.** All connections share the same inbound mpsc channels and the same outbound T3 channel. Fairness under N-device load relies on tokio scheduling. Acceptable for the "one ECS world per host" model the paper describes; revisit if many-device benchmarks show starvation.
- **No graceful shutdown.** The `quic-runtime` thread is parked on `pending()`; spawned tasks (accept loop, per-conn demux, outbound drain, per-command T3 spawns) are orphaned at process exit. Fine for research runs.
- **Bind failure is fatal.** `OnEnter(Starting)` panics if `bind_endpoint` fails. A `ServerState::Failed` variant joins when we wire proper error surfacing.
- **T3 outbound concurrency is unbounded.** `drain_outbound_t3` spawns one task per command (so a stuck `read_exact` can't stall the pipeline). Under sustained T1 ingest beyond ~10k msg/s the per-command tasks queue behind the tokio scheduler and T3 P99 latency climbs into the hundreds of ms while throughput holds. If we need true latency isolation under load, add a `tokio::Semaphore` cap or a dedicated runtime/thread for T3.
- **Schedule rate-gating is approximate.** `ScheduleRunnerPlugin::run_loop(period)` honours `period` as a minimum; observed `tick_hz` runs ~85% of target on macOS dev (target 60 → ~50). Should be tighter on the CM5; revisit if M6 sweeps depend on a steady tick.
- **Channel ownership is per-host, not per-connection.** All connections share the same inbound mpsc channels and the outbound T3 channel. Fairness under N-device load relies on tokio scheduling. Acceptable for "one ECS world per host".
- **No graceful shutdown.** The `quic-runtime` thread parks on `pending()`; spawned tasks orphan at process exit. Fine for research runs.
- **Bind failure is fatal.** `OnEnter(Starting)` panics if `bind_endpoint` fails.
- **T3 outbound concurrency is unbounded.** `drain_outbound_t3` spawns one task per command. Under sustained T1 ingest beyond ~10k msg/s the per-command tasks queue behind the tokio scheduler and T3 P99 climbs into the hundreds of ms (throughput still holds). If we ever need strict T3 latency isolation under heavy T1 load, add a `tokio::Semaphore` cap or a dedicated runtime/thread for T3.
- **NTP drift over a long bench shifts the across-row T1 P99 baseline.** Visible in `tbl-latency` (47 ms at 50k → 28 ms at 200k). The within-row Δ is what speaks to isolation; the across-row absolutes don't. Paper caption explains this.
- **Schedule rate-gating is approximate.** Observed `tick_hz` runs ~85% of target on macOS dev; tighter on the CM5.
## Run / verify
```bash
make certs # generate certs/server.{crt,key} (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build # cargo build --release (native, depends on certs)
make build-cm5 # aarch64 cross-build for the CM5 (depends on certs)
make deploy-cm5 # scp to $CM5_HOST (set in env or override Makefile var)
make render # build the paper PDF
make preview # live-reload paper preview at :4848
make clean # cargo clean + drop generated paper outputs
make certs # dev TLS (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build # cargo build --release native
make build-cm5 # aarch64 cross-build
make deploy-cm5 # scp to $CM5_HOST
make render # paper PDF
make preview # live-reload paper at :4848
make monitoring-up # docker-compose VM + Grafana
```
`certs/` is gitignored; `make build` regenerates the dev cert if missing. From the repo root: `cargo run -p substrate` boots, prints the loaded `AppConfig`, and idles. `config.toml` and cert paths are resolved relative to the cwd — always launch from the repo root.
**Tests.** `cargo test --workspace` runs codec unit tests + world unit tests + 5 integration tests (T1, T2, full closed-loop) in [simulator/tests/](simulator/tests/). Each integration test calls `bind_endpoint` + `accept_loop` in-process on `127.0.0.1:0`. The full-loop test stands up the real outbound machinery (`accept_loop` + `drain_outbound_t3`) and asserts the engine-state flag flips in both directions.
**Tests.** `cargo test --workspace` runs the codec unit tests in `substrate` plus the end-to-end integration tests in [simulator/tests/](simulator/tests/). Each integration test calls `bind_endpoint` + `accept_loop` in-process on `127.0.0.1:0` (OS-assigned port), connects a `SimulatorClient` against it, and asserts what arrives on the test-owned T1 receiver. Add a new `simulator/tests/end_to_end_*.rs` for each new wire path (T2 uni, T3 bi) as the substrate-side demux lands.
**Metrics scrape.** With `metrics_enabled = true` (default), the substrate exposes a Prometheus-format endpoint:
**Metrics scrape.** With `metrics_enabled = true` (default):
```bash
curl http://127.0.0.1:9100/metrics
```
A docker-compose stack under [monitoring/](monitoring/) brings up VictoriaMetrics + Grafana auto-provisioned: `make monitoring-up` then Grafana at <http://localhost:3000> (admin / admin), both dashboards under the `quic_ecs_dt` folder. The compose mounts [dashboards/](dashboards/) directly so any edit to the JSON files re-imports within 10 s.
`make monitoring-up` brings up VictoriaMetrics + Grafana auto-provisioned at <http://localhost:3000> (admin / admin); the dashboards mount live from [dashboards/](dashboards/) so JSON edits re-import within ~10 s.
Two Grafana dashboards under [dashboards/](dashboards/):
- [`runtime.json`](dashboards/runtime.json) — tick rate, RSS, per-tier received/dropped/latency, channel depth (paper §Evaluation surface).
- [`sensors.json`](dashboards/sensors.json) — thermometer + gauges + stat panels per `SensorType`, driven by `sensor_aggregate{type, stat}` (operator-facing surface).
Both use the `${datasource}` template variable so you can point them at any Prometheus-compatible source.
**Manual two-process run.** From the repo root, in two shells:
**Full-stack demo.** [scripts/demo.sh](scripts/demo.sh) brings up certs + cargo build + monitoring stack + substrate + simulator and tails the simulator's progress log. Industrial profile by default; Presence dips below threshold every few seconds, triggering substrate-initiated T3 Relay setpoints, visible on the operator dashboard as Current collapsing to ~0 A while Voltage holds.
```bash
# shell 1 — server (use RUST_LOG=substrate=debug to see the per-conn summary)
cargo run -p substrate
# shell 2 — client; --help shows all flags
cargo run -p simulator -- --rate-hz 100 --count 0 --devices 4
./scripts/demo.sh # defaults
PROFILE=single RATE_HZ=100 DEVICES=20 ./scripts/demo.sh
KEEP_MONITORING=1 ./scripts/demo.sh # leave VM + Grafana running on exit
```
Simulator flags (see `cargo run -p simulator -- --help`): `--addr`, `--server-name`, `--cert`, `--rate-hz` (T1 datagram rate; `0` disables T1), `--t2-rate-hz` / `--t3-rate-hz` (per-tier event rate; `0` disables), `--t3-timeout-ms` (T3 ack wait, default `2000`), `--count` (T1 count; `0` = until Ctrl-C), `--devices`, `--sensor-id`, `--sensor-type` (one of `generic|temperature|humidity|pressure|voltage|current`), `--profile` (`single` or `industrial` — 5 sensors per device on ids 0..4 covering all types). The client logs a one-second `progress` line with `t1_sent`/`t2_sent`/`t3_sent`/`t3_timeouts`/per-tier observed Hz, and a final `simulator done` line with elapsed time on exit.
**Manual two-process run.** From the repo root:
```bash
# shell 1 — server
cargo run -p substrate
# shell 2 — client
cargo run -p simulator -- --profile industrial --rate-hz 100 --count 0 --devices 4
```
Simulator flags (see `cargo run -p simulator -- --help`): `--addr`, `--server-name`, `--cert`, `--profile {single, industrial}`, `--sensor-type`, `--sensor-id`, `--rate-hz` (T1 datagram rate; `0` disables T1), `--t2-rate-hz` (T2 event rate; `0` disables T2), `--count` (T1 count; `0` = until Ctrl-C), `--devices`. **No simulator-side T3 flag** — T3 is substrate-initiated. Per-second `progress` lines show `t1_sent`/`t2_sent`/`engine={running,stopped}`.
**Bidirectional netem on the CM5.** [scripts/bench-loss.sh](scripts/bench-loss.sh) applies `tc netem loss N%` bidirectionally via an `ifb` ingress-redirect (`BIDI=1` default). [scripts/verify-netem.sh](scripts/verify-netem.sh) confirms it lands on the right interface:
```bash
./scripts/verify-netem.sh <peer-ip> end0 5 # egress only
BIDI=1 ./scripts/verify-netem.sh <peer-ip> end0 5 # both directions via ifb
```
## Key references