Files
quic_ecs_dt/CLAUDE.md
2026-05-12 13:24:03 -04:00

22 KiB
Raw Blame History

quic_ecs_dt — Project Guide for Claude

What & why

Source repo for "QUIC + ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins" — UCAmI 2026 (Plantevin & Francillette, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:

  • plantevin2026ecs — ECS as runtime substrate for industrial DT (200k assets @ 114 Hz on Pi 5).
  • plantevin2026quic — QUIC partial reliability for DT sensor streams (94% P99 reduction vs TCP at 5% loss).

UCAmI hypothesis (the composition question): prior work shows ECS and QUIC each work as substrates independently. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it.

Architecture

Three-tier QUIC ↔ ECS bridge, headless Bevy runtime:

Tier QUIC primitive Use case Channel cap Tx newtype
T1 Unreliable datagrams (RFC 9221) High-freq ephemeral telemetry; drops OK 1024 T1Sender::send_lossy (try_send, drop on full)
T2 Unidirectional streams Ordered threshold events; reliable 512 T2Sender::send (await, backpressure)
T3 Bidirectional streams Actuator commands w/ ACK; per-command oneshot reply 256 T3Sender::send of T3Inbound { command, reply }

QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime; pushes decoded QuicMessage (UUID + sensor_id + f64 + ts + seq, 38 B fixed LE) into tokio::sync::mpsc per tier via the T1Sender / T2Sender / T3Sender newtypes (in substrate/src/transport/mod.rs) so misuse is a type error. Bevy ingest_system drains in PreUpdate, gated by run_if(in_state(ServerState::Started)). Pattern is in substrate/src/transport/ecs.rs.

T3 ack protocol. A device opens a bi-stream and writes one QuicMessage (the command). The demux task reads it, builds a T3Inbound { command, reply: oneshot::Sender<QuicMessage> }, and sends it on the T3 mpsc. The ECS handler writes the ack into reply; the demux task awaits reply_rx and writes the resulting QuicMessage back on the bi-stream. Dropping the oneshot signals "no handler" and propagates as a stream close — used by the placeholder ingest until M4 installs real handlers.

Target hardware: CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand.

Repo map

quic_ecs_dt/
├── paper/                Quarto + LNCS source — single index.qmd, refs in references.bib
├── substrate/            Rust crate: Bevy 0.18 + Quinn 0.11 + rustls 0.23 + Tokio
│   └── src/
│       ├── main.rs       App::new, MinimalPlugins, EcsQuicTransportPlugin
│       ├── config.rs     figment chain: defaults → config.toml → APP_* env
│       └── transport/
│           ├── mod.rs    QuicMessage struct
│           ├── ecs.rs    Plugin: tokio thread + 3 mpsc + PreUpdate ingest
│           └── server.rs run_substrate_server (EMPTY STUB)
├── simulator/            Rust crate: stub today; will be Quinn client + Bevy sensor generators
├── data/                 (created by M6) loopback/, two_machine/ — raw CSVs committed, *_processed ignored
├── Cargo.toml            workspace
└── Makefile              render, preview, build, build-cm5, deploy-cm5

Status

Area State
AppConfig figment loader (defaults → TOML → env) Done — substrate/src/config.rs:42
3-tier MPSC bridge scaffolding (Tokio thread + Bevy plugin) Done — substrate/src/transport/ecs.rs
QuicMessage struct (no codec yet) Defined — substrate/src/transport/mod.rs:4
Quinn server lifecycle Listener up — ServerState{Starting,Started} in substrate/src/transport/state.rs; OnEnter(Starting) → bind + accept loop in substrate/src/transport/ecs.rs. Explicit TransportConfig w/ tuned datagram recv buffer (256 KiB) in substrate/src/transport/server.rs. Per-tier sender newtypes (T1Sender::send_lossy, T2Sender::send, T3Sender::send) in substrate/src/transport/mod.rs
T1 demux (datagrams → ECS) Done — handle_incoming orchestrator + read_datagrams reader in substrate/src/transport/server.rs; decode errors logged but non-fatal; channel-full drops silent at trace; received/dropped/decode_errors counters in the end-of-stream debug line
T2 demux (uni streams → ECS) Done — read_uni_streams accepts streams in substrate/src/transport/server.rs, spawns one task per stream that reads 38 B chunks until EOF; decode failure resets the stream via recv.stop(0) (one bad stream doesn't kill the connection); t2.send().await honours backpressure
T3 demux (bi streams ↔ ECS) Done — accept_bi_streams + read_one_bi_stream in substrate/src/transport/server.rs; reads 38 B command, ships T3Inbound { command, reply: oneshot::Sender } to the ECS, awaits the reply, writes 38 B ack and finishes. If the ECS drops the oneshot (no handler installed yet — the M4 placeholder) send.reset(0) gives the client a clean signal instead of a half-open stream. handle_incoming joins all three readers on close
TLS / self-signed cert Done (M1) — certs/server.{crt,key} via make certs, gitignored. PEM loader in substrate/src/transport/server.rs:15; rustls aws-lc-rs default provider installed in substrate/src/main.rs
Wire codec for QuicMessage (39 B fixed LE, incl. sensor_type: u8) Done — substrate/src/transport/mod.rs; 5 unit tests passing. SensorType enum: Generic / Temperature / Humidity / Pressure / Voltage / Current
tracing-subscriber init w/ RUST_LOG Done (M1) — substrate/src/main.rs:8-12
ECS components (RawSensorData, SmoothedValue) + 4 systems (Ingest/Sim/Export/Diagnostics) Done — entities = (DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, Asset) per (device, sensor); SensorRegistry upserts via HashMap<(Uuid, u16), Entity> in substrate/src/world.rs. IngestSystem drains all three tiers; T3 ack preserves command's sensor_type and returns the device's most recent raw_value. SimulationSystem maintains a 16-sample rolling mean per entity and emits substrate_threshold_crossings_total{type, direction} when the smoothed mean crosses a per-type threshold (Changed<RawSensorData> query so cost scales with ingress, not fleet size). ExportSystem samples substrate_{entities,channel_depth,channel_capacity,rss_bytes} + sensor_aggregate{type, stat} once per second. Diagnostics logs tick_hz once per second
Schedule rate-gating Done (M4) — MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz)) in substrate/src/main.rs; replaces the default busy-loop with the configured period
Prometheus exporter + Grafana dashboards Done (M5) — ObservabilityPlugin in substrate/src/observability.rs installs metrics-exporter-prometheus on the existing tokio runtime. Runtime surface (paper §Evaluation): counters substrate_received_total{tier}, dropped_total{tier=t1}, decode_errors_total{tier}, t3_no_handler_total; latency histograms substrate_latency_us{tier}; gauges substrate_tick_hz, substrate_entities, substrate_channel_depth{tier}, substrate_channel_capacity{tier}, substrate_rss_bytes. Sensor data surface (operator dashboard): per-type aggregates `sensor_aggregate{type, stat=count
Simulator (Quinn client + sensor generators) SimulatorClient lib in simulator/src/client.rs — connects, trusts the substrate's PEM cert via custom ServerCertVerifier (sidesteps CaUsedAsEndEntity); send_datagram(QuicMessage) for T1, send_uni_stream(&[QuicMessage]) for T2, request(&QuicMessage) -> QuicMessage for T3. CLI driver in simulator/src/main.rs with clap flags (--addr, --rate-hz, --t2-rate-hz, --t3-rate-hz, --t3-timeout-ms, --count, --devices, --sensor-id, --sensor-type, --profile, --cert, --server-name); parallel T1+T2+T3 emitters, per-(device,sensor) sequence counters, type-appropriate waveform generators (sin/cos curves centred on realistic sensor ranges), 1-Hz combined progress logs, Ctrl-C drain. --profile industrial fans out to 5 sensors per device (Temperature/Humidity/Pressure/Voltage/Current). Bevy-driven sensor generator still pending
End-to-end test harness Six integration tests across simulator/tests/end_to_end_t1.rs, simulator/tests/end_to_end_t2.rs, simulator/tests/end_to_end_t3.rs: T1 single-datagram round-trip + 32-msg burst order; T2 single-stream order-preservation + 4-stream concurrent per-device ordering; T3 round-trip with fake-ECS handler + no-handler stream-reset. Each test calls bind_endpoint + accept_loop in-process with channels owned by the test
config.toml at repo root Done (M1) — config.toml; loaded by substrate/src/main.rs:9
Benchmark harness (sweep + CSV writer) Missing
CM5 cross-compile / deploy Wired in Makefile:30; not exercised

cargo run -p substrate boots, prints the loaded config, and idles on the (still-empty) Quinn server. MinimalPlugins busy-loops the ECS schedule by default — expected, will gate to tick_rate_hz in M4.

Roadmap

Each milestone has one verification gate. Update Status here as we go.

  • M1 — Wire codec & root config. Done 2026-05-04. Hand-rolled little-endian codec on QuicMessage (38 B fixed: 16 UUID + 2 stream_id + 8 f64 + 8 ts_us + 4 seq) with roundtrip + layout + length-error tests; config.toml at repo root; dev TLS via make certs; structured tracing-subscriber init reads RUST_LOG (default info).
  • M2 — Quinn server + self-signed TLS. Done 2026-05-06. Listener up under ServerState::Starting/Started; type-system tier semantics + T3 oneshot ack protocol; per-connection handle_incoming orchestrator joining T1 datagram, T2 uni-stream, and T3 bi-stream readers. T1 has dropped/decoded counters; T2 resets a stream on decode failure without killing the connection; T3 ships T3Inbound { command, reply } to the ECS and resets the stream when no handler answers. End-to-end coverage: 6 integration tests in simulator/tests/ plus 4 codec unit tests, all green.
  • M3 — Simulator client. Replace simulator/src/main.rs with a Bevy app: Quinn client, N synthetic devices, configurable per-tier rates. Verify: end-to-end loopback drains messages on all three tiers. Status (2026-05-05): simulator made into a lib + bin; SimulatorClient::{connect,send_datagram,close} plus a manual smoke runner in simulator/src/main.rs. Two integration tests in simulator/tests/end_to_end_t1.rs exercise the full T1 path against an in-process substrate. Bevy-driven generator + T2/T3 helpers + load profiles still pending.
  • M4 — ECS world. Done. Asset + DeviceId + SensorId + SensorTypeTag + RawSensorData + SmoothedValue components in substrate/src/world.rs; SensorRegistry resource for O(1) (Uuid, u16) → Entity. IngestSystem drains all three tiers (T1 batched, T2/T3 fully); T3 handler returns the latest sensor value as ack. SimulationSystem runs a per-entity 16-sample rolling mean and emits substrate_threshold_crossings_total{type, direction} on per-type threshold crossings — gives the ECS observable digital-twin work, not just write-through ingest. ExportSystem samples substrate_{entities,channel_depth,channel_capacity,rss_bytes} + sensor_aggregate{type, stat} once per second. DiagnosticsSystem logs tick rate once per second. Schedule rate-gated via ScheduleRunnerPlugin::run_loop(1/tick_rate_hz). 8 unit tests passing (entity create, in-place update, T3 ack, SmoothedValue push/window/non-finite/full-roll, threshold-crossing transition).
  • M5 — Observability (VictoriaMetrics + Grafana). Done. Wire format extended to carry sensor_type: u8 (38 → 39 B, decoded into SensorType enum). Two metric surfaces over metrics-exporter-prometheus:
    • Runtime (paper §Evaluation): substrate_received_total{tier}, dropped_total{tier=t1}, decode_errors_total{tier}, t3_no_handler_total, latency_us{tier} histograms, tick_hz / entities / channel_depth{tier} / rss_bytes gauges.
    • Sensor data (operator surface): sensor_aggregate{type, stat=count|mean|min|max} aggregated per second across the live ECS world. Cardinality bounded to \|SensorType\| × 4 series independent of physical sensor count.
    • Dashboards: dashboards/runtime.json + dashboards/sensors.json.
    • Verified: --profile industrial --devices 2 --count 200 yields 10 entities and all 5 type aggregates with realistic values (T=20.5°C, RH=51%, P=1018 hPa, V=230.2 V, I=12 A).
  • M6 — Benchmark harness. Sweep entity_count ∈ {10k, 50k, 100k, 200k} × loss_rate ∈ {0%, 1%, 5%} with 2k warmup + 5k measurement ticks. Loss via tc netem. Writes data/loopback/final_table.csv. Verify: one full sweep on M4 Max produces a CSV the Quarto figures consume.
  • M7 — CM5 cross-compile & deploy. Exercise Makefile:30 (build-cm5, deploy-cm5); set real CM5_HOST. Verify: binary runs on CM5 with a feed from M4 Max over 1 Gbps Ethernet.
  • M8 — Two-machine run + paper render. Sweep with simulator on M4 Max → substrate on CM5; populate data/two_machine/final_table.csv; make render produces a PDF. Update §Evaluation prose to reflect actual numbers. Current paper figures (241 Hz, 64 µs / 15.8 ms P99, 2.6 µs jitter, 1.02 MB/1k, R²=0.9999) are aspirational placeholders — they may move and the conclusions may shift; that's expected.

Conventions

  • Rust: edition 2024; workspace at root with simulator + substrate; opt-level=1 dev, opt-level=3 for deps.
  • Pinned crates: Bevy 0.18, Quinn 0.11, rustls 0.23, Tokio 1 (full), figment 0.10 (toml + env), uuid 1.23 (v4), serde 1.
  • Config: figment chain — defaults in substrate/src/config.rs:25config.toml → env APP_* (double-underscore for nesting, e.g. APP_NETWORK__SERVER_PORT=9000).
  • Bevy: headless — MinimalPlugins only; do not pull rendering plugins.
  • Tokio↔Bevy: keep the dedicated-thread + mpsc pattern in substrate/src/transport/ecs.rs:49; do not block the ECS schedule on async work.
  • Paper: Quarto + LNCS template (paper/_extensions/template.tex, paper/_quarto.yml). Never commit llncs.cls or splncs04.bst — CTAN licensing; download per README.md:25-34.
  • Data: raw CSVs under data/ are committed; *_processed.csv is gitignored. Paper figures consume data/loopback/final_table.csv and data/two_machine/final_table.csv.
  • Build artifacts: target/, paper/_output/, paper/figures/, paper/.quarto/, paper/index.tex all gitignored.
  • Errors: anyhow (with .context()) for internal startup paths where the error type is uninteresting; thiserror for boundary types we want to match against (e.g. WireError in the codec).
  • Warnings: let real warnings show. No #[allow(dead_code)], _var blanket suppression, or PhantomData shims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands. See feedback memory.

Known deferrals

  • Channel ownership is per-host, not per-connection. All connections share the same three mpsc channels. Fairness under N-device load relies on tokio scheduling. Acceptable for the "one ECS world per host" model the paper describes; revisit if many-device benchmarks show starvation.

  • No graceful shutdown. The quic-runtime thread is parked on pending(); spawned tasks (accept loop, per-conn demux) are orphaned at process exit. Fine for research runs; we'll need an OnExit(Started) (or a Stopping state) when M5 observability needs clean drain or M8 wants finalised CSV writes.

  • Bind failure is fatal. OnEnter(Starting) panics if bind_endpoint fails. A ServerState::Failed variant joins when we wire proper error surfacing.

  • T3 ack semantics are minimal. The current handler echoes the device's most recent raw_value with a server timestamp — adequate for "read sensor" commands, not for actuator-write semantics. A future iteration may introduce an ActuatorState component and a setpoint-apply path; for now T3 is best framed as "reliable read/query RPC" in the paper.

  • Schedule rate-gating is approximate. ScheduleRunnerPlugin::run_loop(period) honours period as a minimum; observed tick_hz runs ~85% of target on macOS dev (target 60 → ~50). Should be tighter on the CM5; revisit if M6 sweeps depend on a steady tick.

Run / verify

make certs              # generate certs/server.{crt,key} (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build              # cargo build --release (native, depends on certs)
make build-cm5          # aarch64 cross-build for the CM5 (depends on certs)
make deploy-cm5         # scp to $CM5_HOST (set in env or override Makefile var)
make render             # build the paper PDF
make preview            # live-reload paper preview at :4848
make clean              # cargo clean + drop generated paper outputs

certs/ is gitignored; make build regenerates the dev cert if missing. From the repo root: cargo run -p substrate boots, prints the loaded AppConfig, and idles. config.toml and cert paths are resolved relative to the cwd — always launch from the repo root.

Tests. cargo test --workspace runs the codec unit tests in substrate plus the end-to-end integration tests in simulator/tests/. Each integration test calls bind_endpoint + accept_loop in-process on 127.0.0.1:0 (OS-assigned port), connects a SimulatorClient against it, and asserts what arrives on the test-owned T1 receiver. Add a new simulator/tests/end_to_end_*.rs for each new wire path (T2 uni, T3 bi) as the substrate-side demux lands.

Metrics scrape. With metrics_enabled = true (default), the substrate exposes a Prometheus-format endpoint:

curl http://127.0.0.1:9100/metrics

A docker-compose stack under monitoring/ brings up VictoriaMetrics + Grafana auto-provisioned: make monitoring-up then Grafana at http://localhost:3000 (admin / admin), both dashboards under the quic_ecs_dt folder. The compose mounts dashboards/ directly so any edit to the JSON files re-imports within 10 s.

Two Grafana dashboards under dashboards/:

  • runtime.json — tick rate, RSS, per-tier received/dropped/latency, channel depth (paper §Evaluation surface).
  • sensors.json — thermometer + gauges + stat panels per SensorType, driven by sensor_aggregate{type, stat} (operator-facing surface).

Both use the ${datasource} template variable so you can point them at any Prometheus-compatible source.

Manual two-process run. From the repo root, in two shells:

# shell 1 — server (use RUST_LOG=substrate=debug to see the per-conn summary)
cargo run -p substrate

# shell 2 — client; --help shows all flags
cargo run -p simulator -- --rate-hz 100 --count 0 --devices 4

Simulator flags (see cargo run -p simulator -- --help): --addr, --server-name, --cert, --rate-hz (T1 datagram rate; 0 disables T1), --t2-rate-hz / --t3-rate-hz (per-tier event rate; 0 disables), --t3-timeout-ms (T3 ack wait, default 2000), --count (T1 count; 0 = until Ctrl-C), --devices, --sensor-id, --sensor-type (one of generic|temperature|humidity|pressure|voltage|current), --profile (single or industrial — 5 sensors per device on ids 0..4 covering all types). The client logs a one-second progress line with t1_sent/t2_sent/t3_sent/t3_timeouts/per-tier observed Hz, and a final simulator done line with elapsed time on exit.

Key references

  • Prior self-citations: plantevin2026ecs, plantevin2026quic (both IEEE SWC 2026, "to appear").
  • QUIC: RFC 9000 (core), RFC 9221 (unreliable datagrams).
  • DT foundations: Tao et al. 2019; Grieves & Vickers 2017; Minerva et al. 2020.
  • ECS: Nystrom 2014, Game Programming Patterns.
  • Mixed-reliability transport: Peeck et al. (W2RP for DDS).
  • DT sync metrics: Çakır et al. 2023 (Twin Alignment Ratio); Bellavista et al. 2023 (ODTE).
  • Industrial QUIC/IIoT: Fernández et al. 2021; Boeding et al. 2025.
  • Full bibliography: paper/references.bib.