24 KiB
quic_ecs_dt — Project Guide for Claude
What & why
Source repo for "QUIC and ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins: An Integrated Empirical Study" — submitted to UCAmI 2026 (Track 2: Internet of EveryThing (IoT, People & Processes) and Sensors; primary topic IoE interoperability, integration and performance, secondary topic IoE experimental results and deployment scenarios). Single-author (Plantevin, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:
plantevin2026ecs— ECS as runtime substrate for industrial DT (200k assets @ 114 Hz on Pi 5).plantevin2026quic— QUIC partial reliability for DT sensor streams (94% P99 reduction vs TCP at 5% loss).
UCAmI hypothesis (the composition question): prior work shows ECS and QUIC each work as substrates independently. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it on a real CM5 ↔ M4 Max two-machine deployment.
Architecture
Three-tier QUIC ↔ ECS bridge, headless Bevy runtime. T1/T2 are inbound (device → substrate); T3 is outbound (substrate → device, actuator commands):
| Tier | QUIC primitive | Direction | Use case | Channel cap | Sender |
|---|---|---|---|---|---|
| T1 | Unreliable datagrams (RFC 9221) | device → substrate | High-freq ephemeral telemetry; drops OK | 1024 | T1Sender::send_lossy (try_send, drop on full) |
| T2 | Unidirectional streams | device → substrate | Ordered threshold events; reliable | 512 | T2Sender::send (await, backpressure) |
| T3 | Bidirectional streams | substrate → device | Actuator commands w/ ACK | 256 | T3OutboundSender::try_send of OutboundT3 { target_device, sensor_id, raw_value, sensor_type } |
QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime. T1/T2 decoded QuicMessages (39 B fixed LE: 16 UUID + 2 sensor_id + 8 f64 + 8 ts + 4 seq + 1 sensor_type) flow into per-tier tokio::sync::mpsc channels and are drained by Bevy's ingest_system in PreUpdate, gated by run_if(in_state(ServerState::Started)). T3 flows the other way: automation_system constructs OutboundT3 items and the tokio-side drain_outbound_t3 task opens bi-streams to the target device. The per-tier sender newtypes (in substrate/src/transport/mod.rs) make tier mixups a type error. Pattern in substrate/src/transport/ecs.rs.
T3 actuator-command protocol. The substrate's automation_system decides to actuate (e.g. Presence < 1.0 ⇒ Relay = stop) and pushes an OutboundT3 onto the outbound channel. The tokio drain_outbound_t3 pops it, looks up the target device's quinn::Connection in a ConnectionRegistry (populated by read_datagrams / read_one_uni_stream on first sight of each device UUID), then spawns one task per command to do conn.open_bi() → write 39 B → finish → read 39 B ack. Per-task spawning means a single stuck read_exact can't stall the pipeline. Latency from open_bi() to ack-receipt is recorded as substrate_latency_us{tier="t3"} and a successful ack increments substrate_received_total{tier="t3"}. Misses (substrate_t3_outbound_no_route_total), drops (substrate_t3_outbound_dropped_total), and bi-stream errors (substrate_t3_outbound_errors_total) each have their own counter.
Connection registry. Arc<std::sync::RwLock<HashMap<Uuid, quinn::Connection>>>. quinn::Connection is internally Arc; one simulator process commonly hosts 7 device UUIDs sharing one connection. Registry insert is idempotent (ensure_registered). On conn.closed().await returning, handle_incoming purges every key whose Connection::stable_id() matches the closed connection.
Target hardware: CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand; benchmark sweeps live on the CM5.
Repo map
quic_ecs_dt/
├── paper/ Quarto + LNCS source — single index.qmd, refs in references.bib
├── substrate/ Rust crate: Bevy 0.18 + Quinn 0.11 + rustls 0.23 + Tokio
│ └── src/
│ ├── main.rs App::new, MinimalPlugins, EcsQuicTransportPlugin, ObservabilityPlugin
│ ├── lib.rs re-exports
│ ├── config.rs figment chain: defaults → config.toml → APP_* env (split on "__")
│ ├── observability.rs metrics-exporter-prometheus on :9100
│ ├── transport/
│ │ ├── mod.rs QuicMessage codec + tier sender newtypes + OutboundT3
│ │ ├── ecs.rs EcsQuicTransportPlugin: tokio thread + bridge + registry + drain spawn
│ │ ├── server.rs bind_endpoint + accept_loop + read_datagrams + read_uni_streams
│ │ │ + drain_outbound_t3 + synthetic_t3_driver + ConnectionRegistry
│ │ └── state.rs ServerState{Starting, Started}
│ └── world/
│ ├── mod.rs WorldPlugin (5 systems wired into Pre/Update/Post)
│ ├── components.rs Asset, DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, threshold_for
│ ├── resources.rs SensorRegistry, DiagnosticsState, ExportSampleState
│ ├── systems.rs ingest, simulation, automation, export, diagnostics
│ └── tests.rs 8 unit tests inc. automation_dispatches_relay_stop
├── simulator/ Rust crate: Quinn client + sensor generators + T3 receiver
│ ├── src/
│ │ ├── main.rs CLI driver + HTTP-trigger task + T1 inline loop
│ │ ├── lib.rs module exports
│ │ ├── client.rs SimulatorClient (connect, send_datagram, send_uni_stream, request, close)
│ │ ├── commands.rs run_command_receiver (substrate → device T3 accept-bi loop)
│ │ ├── emitters.rs run_t2_emitter (T1 lives inline in main.rs)
│ │ └── profile.rs SensorProfile (single | industrial), generate_value
│ └── tests/ T1, T2, end-to-end full-loop integration tests
├── data/
│ ├── two_machine/ CM5 ↔ M4 Max sweep — final_table.csv (load-bearing for the paper)
│ └── local/ loopback sweeps (scaling.csv, cross_tier.csv)
├── scripts/
│ ├── bench-loss.sh M6 sweep entities×loss → data/two_machine/final_table.csv
│ ├── bench-scaling.sh T1 rate sweep + optional synthetic-T3 cross-tier mode
│ ├── bench-client.sh M8 client driver (run from Mac when substrate is on CM5)
│ ├── demo.sh full-stack demo: certs + build + VM/Grafana + sub + sim
│ ├── setup-cm5.sh CM5 provisioning (apt + cargo install)
│ └── verify-netem.sh confirm tc-netem is shaping in the right direction (BIDI=1 for ifb mode)
├── monitoring/ docker-compose: VictoriaMetrics + Grafana auto-provisioned
├── dashboards/ runtime.json + sensors.json
├── certs/ gitignored, regenerated by `make certs`
├── Cargo.toml workspace
└── Makefile render, preview, build, build-cm5, deploy-cm5, monitoring-up
Status
Code (substrate + simulator):
| Area | State |
|---|---|
AppConfig figment loader (defaults → TOML → env with __ split) |
Done — substrate/src/config.rs. Env override actually works (Env::prefixed("APP_").split("__")); discovered late that the previous chain silently ignored env vars |
| 39 B wire codec | Done — substrate/src/transport/mod.rs, 5 unit tests |
| Quinn server lifecycle + TLS | Done — bind_endpoint + accept_loop in substrate/src/transport/server.rs; ServerState{Starting, Started} in state.rs; explicit TransportConfig w/ 256 KiB datagram recv buffer; dev cert via make certs, rustls aws-lc-rs provider installed in main.rs |
| T1 demux (datagrams → ECS) | Done. read_datagrams reader; decode errors non-fatal; channel-full drops silent; per-stream counters in debug summary. Calls ensure_registered on first decode so outbound T3 can route to this device |
| T2 demux (uni streams → ECS) | Done. read_uni_streams accepts streams, spawns one task per stream that reads 39 B chunks until EOF; decode failure resets the stream via recv.stop(0); t2.send().await honours backpressure; first decode also calls ensure_registered |
| T3 outbound (ECS → device) | Done. drain_outbound_t3 task pops OutboundT3 items, looks up the target device's Connection in ConnectionRegistry, spawns one task per command to do open_bi → write 39 B → finish → read ack. Per-task spawning prevents a single stuck read_exact from stalling the pipeline. Records substrate_latency_us{tier="t3"} on success; counts no-route / dropped / errors separately. The old simulator-initiated T3 inbound path (T3Sender / T3Inbound / accept_bi_streams) is gone |
| Connection registry (Uuid → Connection) | Done — Arc<RwLock<HashMap<Uuid, quinn::Connection>>>; idempotent insert via ensure_registered; purged in handle_incoming after conn.closed().await using Connection::stable_id() |
| Synthetic T3 driver (bench only) | Done. synthetic_t3_driver task in server.rs spawned by accept_loop when APP_NETWORK__SYNTHETIC_T3_RATE_HZ > 0. Round-robins over registered devices, toggles raw_value between 0/1, pushes through the same outbound channel automation_system uses |
| ECS components + 5 systems | Done — world/. Entities = (Asset, DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue) per (device, sensor). 5 systems: ingest (PreUpdate, drains T1+T2), simulation (Update, rolling mean + threshold-crossings counter), automation (Update, Presence-cross → t3_out.try_send(OutboundT3{Relay setpoint}) + local mirror), export (PostUpdate, per-second metric sample), diagnostics (PostUpdate, per-second tick_hz log) |
| Schedule rate-gating | Done — MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz)) in main.rs |
| Prometheus exporter + Grafana | Done. metrics-exporter-prometheus on :9100 via ObservabilityPlugin. Runtime metrics: substrate_received_total{tier}, substrate_dropped_total{tier=t1}, substrate_decode_errors_total{tier}, substrate_t3_outbound_*_total, substrate_latency_us{tier} histograms, substrate_tick_hz, substrate_entities, substrate_channel_depth{tier}, substrate_rss_bytes. Sensor data: sensor_aggregate{type, stat=count|mean|min|max}. Dashboards: dashboards/runtime.json + dashboards/sensors.json |
| Simulator binary | Done — simulator/src/main.rs. Clap flags: --addr, --server-name, --cert, --profile {single, industrial}, --sensor-type, --sensor-id, --rate-hz, --t2-rate-hz, --count, --devices. industrial profile fans out to 7 sensors per device on ids 0..6 (Temperature/Humidity/Pressure/Voltage/Current/Presence/Relay). HTTP trigger on :9002 (POST /trigger) pushes Presence=0 over T2 — operator-facing demo entry point. T1/T2 emitters check engine_running per tick; when false, Current waveform drops to ~0 while Voltage stays at ~230 V |
| Simulator command receiver | Done — simulator/src/commands.rs. run_command_receiver loops on conn.accept_bi(), decodes 39 B, flips engine_running on sensor_type == Relay setpoints, writes 39 B ack. Spawned by main.rs post-connect. new_engine_state() constructor exported for integration tests |
| End-to-end test harness | 18 tests, all green. 5 codec unit tests; 8 world unit tests (incl. automation_dispatches_relay_stop_when_presence_drops); 2 T1 + 2 T2 integration tests; 1 full closed-loop test (simulator/tests/end_to_end_full_loop.rs: Presence < 1.0 → substrate T3 → engine_running flips to false; then Presence > 1.0 → flips back) |
| Benchmark scripts | Done. bench-loss.sh — entity × loss sweep, bidirectional tc-netem via ifb on the CM5 (BIDI=1 default). bench-scaling.sh — T1 rate sweep + optional substrate-side APP_NETWORK__SYNTHETIC_T3_RATE_HZ. verify-netem.sh — sanity-check netem on the right interface in the right direction (BIDI=1 mode covers ingress via ifb) |
| CM5 deploy | Done — make build-cm5 && make deploy-cm5; setup-cm5.sh provisions deps. Bench has been run end-to-end on CM5; data lives in data/two_machine/final_table.csv |
Paper:
| Area | State |
|---|---|
| Track + topics chosen | Done — UCAmI Track 2 (IoE and Sensors); primary IoE interoperability, integration and performance; secondary IoE experimental results and deployment scenarios |
| Abstract | Done. Honest framing: "tick rate remains an order of magnitude above the cadence required" (not "stable"), mixed-reliability isolation as the T1-vs-T3 story, 0.12 MB/1k slope |
| Tables 2/3/4 from real CM5 data | Done. Native markdown tables driven by inline {python} values reading from data/two_machine/final_table.csv; cross-refs (@tbl-latency, @tbl-throughput, @tbl-t3-rtt) resolve in the LNCS LaTeX output. Earlier display(Markdown(...)) approach didn't register with Quarto's cross-ref filter; switched to native md tables with inline-python cells |
fig-isolation |
Dropped. Cross-tier story now told by tbl-latency + tbl-t3-rtt (T1 flat under loss, T3 absorbs ~38 ms retransmit). Cleaner than the loopback fig. data/local/cross_tier.csv is still on disk but the paper no longer reads it |
| Architecture §3 + Table 1 | Updated for substrate-initiated T3. Table 1 T3 row reads "OutboundT3 enqueue + ack | Bidirectional stream (server-initiated)"; the connection-registry / per-device routing is described in the prose |
| Implementation §4 Automation paragraph | Updated for the new outbound T3 path; describes the per-device registry, the per-command bi-stream, and the simulator-side run_command_receiver engine-state flip |
| Discussion + Conclusion | Honest now: drops the unbacked "<5% IngestSystem drain" and "Grafana adds no overhead" claims; conclusion populates both 0%-loss and 5%-loss Hz from data |
| Render | Clean against LNCS LaTeX template (make render → 10-page PDF, no Quarto warnings) |
Roadmap
Treat the milestone log as historical. The paper-side work below tracks what's left before camera-ready.
- M1 — Wire codec & root config. ✅ 2026-05-04.
- M2 — Quinn server + TLS. ✅ 2026-05-06.
- M3 — Simulator client. ✅ Done.
SimulatorClient+ CLI driver + waveform profiles + HTTP trigger + closed-loop command receiver. - M4 — ECS world. ✅ Done. 5 systems wired; automation closes the T3 loop.
- M5 — Observability. ✅ Done. Both dashboards live; metrics exposed via prometheus scrape.
- M6 — Benchmark harness. ✅ Done.
bench-loss.sh+bench-scaling.sh+verify-netem.sh(last one added when egress-only netem was masking the inbound T1 loss path; nowifbingress shaping is default). - M7 — CM5 cross-compile & deploy. ✅ Done. Multiple sweeps shipped from CM5.
- M8 — Two-machine run + paper render. ✅ Done. Paper renders against data/two_machine/final_table.csv; all inline scalars and tables populate from real numbers.
- M9 — T3 inversion (substrate-initiated actuator commands). ✅ 2026-05-13. The paper's Table 1 said T3 was "actuator commands" but the code had it inverted (device → substrate RPC). Refactored to match the paper: substrate opens bi-streams, simulator's
run_command_receiveraccepts. Full closed-loop integration test landed. - M10 — Abstract submission polish. ⏳ In progress. Top-of-paper fixes shipped (abstract framing, contributions paragraph, Table 1 T3 row, Architecture §3 backpressure paragraph, author affiliation,
(author?)cite markers). Remaining polish is full-paper-only (Implementation §4 module-list lies, code listing with fake types, Observability §4.2 push-vs-pull mismatch, Experimental Setup §5.1 stale tc-netem / tick counts / loopback-vs-two-machine sentence). None block abstract submission.
Open polish items (not blocking abstract submission):
- §4.1 Integrated Prototype still lists six systems including a non-existent
FaultInjection; module list saystransport.rs/world.rs/metrics.rs/main.rsbut the actual layout istransport/,world/,observability.rs,config.rs,main.rs,lib.rsplus a separatesimulatorcrate. - §4.1 code listing uses fictional types (
AssetId,EntityMap,TickDiagnostics). Easier to drop the listing than to rewrite faithfully. - §4.2 Observability Stack describes a push model with InfluxDB line protocol; actual code uses
metrics-exporter-prometheusexposing/metricsfor VM scrape. - §5.1 Experimental Setup needs three updates: tc-netem direction (now bidirectional via
ifb), "2,000 warmup ticks and 5,000 measurement ticks" → "20 s warmup + 50 s window (wall-clock)", and drop the "loopback for latency / two-machine for throughput" sentence (all numbers are from the two-machine sweep now).
Conventions
- Rust: edition 2024; workspace at root with
simulator+substrate. - Pinned crates: Bevy 0.18, Quinn 0.11, rustls 0.23, Tokio 1 (full), figment 0.10 (toml + env), uuid 1.23 (v4), serde 1.
- Config:
figmentchain — defaults →config.toml→ envAPP_*with__nesting (e.g.APP_NETWORK__SERVER_PORT=9000,APP_NETWORK__SYNTHETIC_T3_RATE_HZ=100). - Bevy: headless —
MinimalPluginsonly; do not pull rendering plugins. - Tokio↔Bevy: keep the dedicated-thread + mpsc pattern in transport/ecs.rs; do not block the ECS schedule on async work.
- Paper: Quarto + LNCS template (paper/_extensions/template.tex, paper/_quarto.yml). Never commit
llncs.clsorsplncs04.bst— CTAN licensing; download per README.md. For tables in LaTeX target, use native markdown tables with: Caption {#tbl-foo}syntax and inline{python}cells, notdisplay(Markdown(...))chunks — Quarto's cross-ref filter doesn't pick the latter up in LaTeX output. - Data: raw CSVs under
data/are committed;*_processed.csvis gitignored. Paper figures consumedata/two_machine/final_table.csvexclusively (the previousdata/loopback/was renamed todata/two_machine/once it became the real CM5 sweep). - Errors:
anyhow(with.context()) for internal startup paths;thiserrorfor boundary types we want to match against (e.g.WireErrorin the codec). - Warnings: let real warnings show. No
#[allow(dead_code)],_varblanket suppression, orPhantomDatashims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands.
Known deferrals
- Channel ownership is per-host, not per-connection. All connections share the same inbound mpsc channels and the outbound T3 channel. Fairness under N-device load relies on tokio scheduling. Acceptable for "one ECS world per host".
- No graceful shutdown. The
quic-runtimethread parks onpending(); spawned tasks orphan at process exit. Fine for research runs. - Bind failure is fatal.
OnEnter(Starting)panics ifbind_endpointfails. - T3 outbound concurrency is unbounded.
drain_outbound_t3spawns one task per command. Under sustained T1 ingest beyond ~10k msg/s the per-command tasks queue behind the tokio scheduler and T3 P99 climbs into the hundreds of ms (throughput still holds). If we ever need strict T3 latency isolation under heavy T1 load, add atokio::Semaphorecap or a dedicated runtime/thread for T3. - NTP drift over a long bench shifts the across-row T1 P99 baseline. Visible in
tbl-latency(47 ms at 50k → 28 ms at 200k). The within-row Δ is what speaks to isolation; the across-row absolutes don't. Paper caption explains this. - Schedule rate-gating is approximate. Observed
tick_hzruns ~85% of target on macOS dev; tighter on the CM5.
Run / verify
make certs # dev TLS (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build # cargo build --release native
make build-cm5 # aarch64 cross-build
make deploy-cm5 # scp to $CM5_HOST
make render # paper PDF
make preview # live-reload paper at :4848
make monitoring-up # docker-compose VM + Grafana
Tests. cargo test --workspace runs codec unit tests + world unit tests + 5 integration tests (T1, T2, full closed-loop) in simulator/tests/. Each integration test calls bind_endpoint + accept_loop in-process on 127.0.0.1:0. The full-loop test stands up the real outbound machinery (accept_loop + drain_outbound_t3) and asserts the engine-state flag flips in both directions.
Metrics scrape. With metrics_enabled = true (default):
curl http://127.0.0.1:9100/metrics
make monitoring-up brings up VictoriaMetrics + Grafana auto-provisioned at http://localhost:3000 (admin / admin); the dashboards mount live from dashboards/ so JSON edits re-import within ~10 s.
Full-stack demo. scripts/demo.sh brings up certs + cargo build + monitoring stack + substrate + simulator and tails the simulator's progress log. Industrial profile by default; Presence dips below threshold every few seconds, triggering substrate-initiated T3 Relay setpoints, visible on the operator dashboard as Current collapsing to ~0 A while Voltage holds.
./scripts/demo.sh # defaults
PROFILE=single RATE_HZ=100 DEVICES=20 ./scripts/demo.sh
KEEP_MONITORING=1 ./scripts/demo.sh # leave VM + Grafana running on exit
Manual two-process run. From the repo root:
# shell 1 — server
cargo run -p substrate
# shell 2 — client
cargo run -p simulator -- --profile industrial --rate-hz 100 --count 0 --devices 4
Simulator flags (see cargo run -p simulator -- --help): --addr, --server-name, --cert, --profile {single, industrial}, --sensor-type, --sensor-id, --rate-hz (T1 datagram rate; 0 disables T1), --t2-rate-hz (T2 event rate; 0 disables T2), --count (T1 count; 0 = until Ctrl-C), --devices. No simulator-side T3 flag — T3 is substrate-initiated. Per-second progress lines show t1_sent/t2_sent/engine={running,stopped}.
Bidirectional netem on the CM5. scripts/bench-loss.sh applies tc netem loss N% bidirectionally via an ifb ingress-redirect (BIDI=1 default). scripts/verify-netem.sh confirms it lands on the right interface:
./scripts/verify-netem.sh <peer-ip> end0 5 # egress only
BIDI=1 ./scripts/verify-netem.sh <peer-ip> end0 5 # both directions via ifb
Key references
- Prior self-citations:
plantevin2026ecs,plantevin2026quic(both IEEE SWC 2026, "to appear"). - QUIC: RFC 9000 (core), RFC 9221 (unreliable datagrams).
- DT foundations: Tao et al. 2019; Grieves & Vickers 2017; Minerva et al. 2020.
- ECS: Nystrom 2014, Game Programming Patterns.
- Mixed-reliability transport: Peeck et al. (W2RP for DDS).
- DT sync metrics: Çakır et al. 2023 (Twin Alignment Ratio); Bellavista et al. 2023 (ODTE).
- Industrial QUIC/IIoT: Fernández et al. 2021; Boeding et al. 2025.
- Full bibliography: paper/references.bib.