24 KiB
quic_ecs_dt — Project Guide for Claude
What & why
Source repo for "QUIC + ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins" — UCAmI 2026 (Plantevin & Francillette, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:
plantevin2026ecs— ECS as runtime substrate for industrial DT (200k assets @ 114 Hz on Pi 5).plantevin2026quic— QUIC partial reliability for DT sensor streams (94% P99 reduction vs TCP at 5% loss).
UCAmI hypothesis (the composition question): prior work shows ECS and QUIC each work as substrates independently. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it.
Architecture
Three-tier QUIC ↔ ECS bridge, headless Bevy runtime. T1/T2 are inbound (device → substrate); T3 is outbound (substrate → device, actuator commands):
| Tier | QUIC primitive | Direction | Use case | Channel cap | Sender |
|---|---|---|---|---|---|
| T1 | Unreliable datagrams (RFC 9221) | device → substrate | High-freq ephemeral telemetry; drops OK | 1024 | T1Sender::send_lossy (try_send, drop on full) |
| T2 | Unidirectional streams | device → substrate | Ordered threshold events; reliable | 512 | T2Sender::send (await, backpressure) |
| T3 | Bidirectional streams | substrate → device | Actuator commands w/ ACK | 256 | T3OutboundSender::try_send of OutboundT3 { target_device, sensor_id, raw_value, sensor_type } |
QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime. T1/T2 decoded QuicMessages (39 B fixed LE: UUID + sensor_id + f64 + ts + seq + sensor_type) flow into per-tier tokio::sync::mpsc channels and are drained by Bevy's ingest_system in PreUpdate, gated by run_if(in_state(ServerState::Started)). T3 flows the other way: automation_system constructs OutboundT3 items and the tokio-side drain_outbound_t3 task opens bi-streams to the target device. The per-tier sender newtypes (in substrate/src/transport/mod.rs) make tier mixups a type error. Pattern is in substrate/src/transport/ecs.rs.
T3 actuator-command protocol. The substrate's automation_system decides to actuate (e.g. Presence < 1.0 ⇒ Relay = stop) and pushes an OutboundT3 onto the outbound channel. The tokio drain task pops it, looks up the target device's quinn::Connection in a ConnectionRegistry (populated by read_datagrams / read_one_uni_stream on first sight of each device UUID), then spawns one task per command to do conn.open_bi() → write 39 B → finish → read 39 B ack. Per-task spawning means a single stuck read_exact can't stall the pipeline. Latency from open_bi() to ack-receipt is recorded as substrate_latency_us{tier="t3"} and a successful ack increments substrate_received_total{tier="t3"}. Misses (substrate_t3_outbound_no_route_total), drops (substrate_t3_outbound_dropped_total), and bi-stream errors (substrate_t3_outbound_errors_total) each have their own counter.
Connection registry. Arc<std::sync::RwLock<HashMap<Uuid, quinn::Connection>>>. quinn::Connection is internally Arc; one simulator process commonly hosts 7 device UUIDs sharing one connection. Registry insert is idempotent (ensure_registered). On conn.closed().await returning, handle_incoming purges every key whose Connection::stable_id() matches the closed connection.
Target hardware: CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand.
Repo map
quic_ecs_dt/
├── paper/ Quarto + LNCS source — single index.qmd, refs in references.bib
├── substrate/ Rust crate: Bevy 0.18 + Quinn 0.11 + rustls 0.23 + Tokio
│ └── src/
│ ├── main.rs App::new, MinimalPlugins, EcsQuicTransportPlugin
│ ├── config.rs figment chain: defaults → config.toml → APP_* env
│ └── transport/
│ ├── mod.rs QuicMessage struct
│ ├── ecs.rs Plugin: tokio thread + 3 mpsc + PreUpdate ingest
│ └── server.rs run_substrate_server (EMPTY STUB)
├── simulator/ Rust crate: stub today; will be Quinn client + Bevy sensor generators
├── data/ (created by M6) loopback/, two_machine/ — raw CSVs committed, *_processed ignored
├── Cargo.toml workspace
└── Makefile render, preview, build, build-cm5, deploy-cm5
Status
| Area | State |
|---|---|
AppConfig figment loader (defaults → TOML → env, __ split) |
Done — substrate/src/config.rs |
| Inbound bridge scaffolding (Tokio thread + Bevy plugin) | Done — substrate/src/transport/ecs.rs |
QuicMessage struct + 39 B LE codec |
Done — substrate/src/transport/mod.rs; 5 unit tests passing |
| Quinn server lifecycle | Listener up — ServerState{Starting,Started} in substrate/src/transport/state.rs; OnEnter(Starting) → bind + accept loop in substrate/src/transport/ecs.rs. Explicit TransportConfig w/ tuned datagram recv buffer (256 KiB) in substrate/src/transport/server.rs. Per-tier sender newtypes (T1Sender::send_lossy, T2Sender::send, T3OutboundSender::try_send) in substrate/src/transport/mod.rs |
| T1 demux (datagrams → ECS) | Done — handle_incoming orchestrator + read_datagrams reader in substrate/src/transport/server.rs; decode errors logged but non-fatal; channel-full drops silent at trace; received/dropped/decode_errors counters in the end-of-stream debug line. Calls ensure_registered on first decode so outbound T3 can route to this device |
| T2 demux (uni streams → ECS) | Done — read_uni_streams accepts streams in substrate/src/transport/server.rs, spawns one task per stream that reads 39 B chunks until EOF; decode failure resets the stream via recv.stop(0) (one bad stream doesn't kill the connection); t2.send().await honours backpressure; first decode also calls ensure_registered |
| T3 outbound (ECS → device, substrate-initiated) | Done — drain_outbound_t3 task in substrate/src/transport/server.rs pops OutboundT3 items, looks up the target device's Connection in ConnectionRegistry, spawns one task per command to do open_bi → write 39 B → finish → read ack. Per-task spawning ensures one stuck ack can't stall the pipeline. Records substrate_latency_us{tier="t3"} on success; counts no-route, dropped, and error cases separately. The old simulator-initiated T3 inbound path (T3Sender / T3Inbound / accept_bi_streams) is gone as of this refactor |
| Connection registry (Uuid → Connection) | Done — Arc<RwLock<HashMap<Uuid, quinn::Connection>>> populated by readers; purged in handle_incoming after conn.closed().await using Connection::stable_id(). Constructor new_connection_registry; idempotent insert via ensure_registered |
| TLS / self-signed cert | Done (M1) — certs/server.{crt,key} via make certs, gitignored. PEM loader in substrate/src/transport/server.rs:15; rustls aws-lc-rs default provider installed in substrate/src/main.rs |
Wire codec for QuicMessage (39 B fixed LE, incl. sensor_type: u8) |
Done — substrate/src/transport/mod.rs; 5 unit tests passing. SensorType enum: Generic / Temperature / Humidity / Pressure / Voltage / Current |
tracing-subscriber init w/ RUST_LOG |
Done (M1) — substrate/src/main.rs:8-12 |
ECS components (RawSensorData, SmoothedValue) + 4 systems (Ingest/Sim/Export/Diagnostics) |
Done — entities = (DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, Asset) per (device, sensor); SensorRegistry upserts via HashMap<(Uuid, u16), Entity> in substrate/src/world.rs. IngestSystem drains all three tiers; T3 ack preserves command's sensor_type and returns the device's most recent raw_value. SimulationSystem maintains a 16-sample rolling mean per entity and emits substrate_threshold_crossings_total{type, direction} when the smoothed mean crosses a per-type threshold (Changed<RawSensorData> query so cost scales with ingress, not fleet size). ExportSystem samples substrate_{entities,channel_depth,channel_capacity,rss_bytes} + sensor_aggregate{type, stat} once per second. Diagnostics logs tick_hz once per second |
| Schedule rate-gating | Done (M4) — MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz)) in substrate/src/main.rs; replaces the default busy-loop with the configured period |
| Prometheus exporter + Grafana dashboards | Done (M5) — ObservabilityPlugin in substrate/src/observability.rs installs metrics-exporter-prometheus on the existing tokio runtime. Runtime surface (paper §Evaluation): counters substrate_received_total{tier}, dropped_total{tier=t1}, decode_errors_total{tier}, t3_no_handler_total; latency histograms substrate_latency_us{tier}; gauges substrate_tick_hz, substrate_entities, substrate_channel_depth{tier}, substrate_channel_capacity{tier}, substrate_rss_bytes. Sensor data surface (operator dashboard): per-type aggregates `sensor_aggregate{type, stat=count |
| Simulator (Quinn client + sensor generators) | SimulatorClient lib in simulator/src/client.rs — connects, trusts the substrate's PEM cert via custom ServerCertVerifier (sidesteps CaUsedAsEndEntity); send_datagram(QuicMessage) for T1, send_uni_stream(&[QuicMessage]) for T2. SimulatorClient::request exists for ad-hoc tests but the binary no longer initiates T3. CLI driver in simulator/src/main.rs with clap flags (--addr, --rate-hz, --t2-rate-hz, --count, --devices, --sensor-id, --sensor-type, --profile, --cert, --server-name). --profile industrial fans out to 7 sensors per device (Temperature/Humidity/Pressure/Voltage/Current/Presence/Relay). T1/T2 emitters check engine_running per-tick — Voltage stays at ~230 V regardless; Current drops to ~0 when stopped. HTTP trigger on :9002 (POST /trigger) pushes a Presence=0 reading via T2 for Grafana-driven demos |
| Simulator command receiver (substrate → device T3) | Done — run_command_receiver in simulator/src/commands.rs loops on conn.accept_bi(), decodes 39 B, sets engine_running from raw_value when sensor_type == Relay, writes 39 B ack. Spawned by main.rs post-connect. new_engine_state() constructor exported for integration tests |
| End-to-end test harness | 18 tests across simulator/tests/end_to_end_t1.rs, simulator/tests/end_to_end_t2.rs, simulator/tests/end_to_end_full_loop.rs: T1 single-datagram + 32-msg burst order; T2 single-stream + 4-stream concurrent ordering; full closed loop (Presence < 1.0 → substrate T3 → simulator engine_running flips, then Presence > 1.0 → flips back). Plus codec + world unit tests including automation_dispatches_relay_stop_when_presence_drops |
config.toml at repo root |
Done — config.toml; loaded by substrate/src/main.rs; env override via APP_* with __ split (Env::prefixed("APP_").split("__")) actually works now |
| Benchmark harness (sweep + CSV writer) | Done — scripts/bench-loss.sh for entity×loss → data/two_machine/final_table.csv; scripts/bench-scaling.sh for T1 rate sweep with optional substrate-side synthetic T3 (T3_RATE_HZ=100 ./scripts/bench-scaling.sh enables APP_NETWORK__SYNTHETIC_T3_RATE_HZ) → data/local/cross_tier.csv. The synthetic driver lives in accept_loop and pushes through the same outbound channel automation_system uses |
| CM5 cross-compile / deploy | Wired in Makefile:30; first trial run completed (commit 272d3b3); scripts/setup-cm5.sh provisions the Pi |
cargo run -p substrate boots, prints the loaded config, and idles on the (still-empty) Quinn server. MinimalPlugins busy-loops the ECS schedule by default — expected, will gate to tick_rate_hz in M4.
Roadmap
Each milestone has one verification gate. Update Status here as we go.
- M1 — Wire codec & root config. ✅ Done 2026-05-04. Hand-rolled little-endian codec on
QuicMessage(38 B fixed: 16 UUID + 2 stream_id + 8 f64 + 8 ts_us + 4 seq) with roundtrip + layout + length-error tests;config.tomlat repo root; dev TLS viamake certs; structuredtracing-subscriberinit readsRUST_LOG(defaultinfo). - M2 — Quinn server + self-signed TLS. ✅ Done 2026-05-06. Listener up under
ServerState::Starting/Started; type-system tier semantics + T3 oneshot ack protocol; per-connectionhandle_incomingorchestrator joining T1 datagram, T2 uni-stream, and T3 bi-stream readers. T1 has dropped/decoded counters; T2 resets a stream on decode failure without killing the connection; T3 shipsT3Inbound { command, reply }to the ECS and resets the stream when no handler answers. End-to-end coverage: 6 integration tests in simulator/tests/ plus 4 codec unit tests, all green. - M3 — Simulator client. Replace simulator/src/main.rs with a Bevy app: Quinn client, N synthetic devices, configurable per-tier rates. Verify: end-to-end loopback drains messages on all three tiers. Status (2026-05-05): simulator made into a lib + bin;
SimulatorClient::{connect,send_datagram,close}plus a manual smoke runner insimulator/src/main.rs. Two integration tests insimulator/tests/end_to_end_t1.rsexercise the full T1 path against an in-process substrate. Bevy-driven generator + T2/T3 helpers + load profiles still pending. - M4 — ECS world. ✅ Done.
Asset+DeviceId+SensorId+SensorTypeTag+RawSensorData+SmoothedValuecomponents in substrate/src/world.rs;SensorRegistryresource for O(1)(Uuid, u16) → Entity.IngestSystemdrains all three tiers (T1 batched, T2/T3 fully); T3 handler returns the latest sensor value as ack.SimulationSystemruns a per-entity 16-sample rolling mean and emitssubstrate_threshold_crossings_total{type, direction}on per-type threshold crossings — gives the ECS observable digital-twin work, not just write-through ingest.ExportSystemsamplessubstrate_{entities,channel_depth,channel_capacity,rss_bytes}+sensor_aggregate{type, stat}once per second.DiagnosticsSystemlogs tick rate once per second. Schedule rate-gated viaScheduleRunnerPlugin::run_loop(1/tick_rate_hz). 8 unit tests passing (entity create, in-place update, T3 ack, SmoothedValue push/window/non-finite/full-roll, threshold-crossing transition). - M5 — Observability (VictoriaMetrics + Grafana). ✅ Done. Wire format extended to carry
sensor_type: u8(38 → 39 B, decoded intoSensorTypeenum). Two metric surfaces overmetrics-exporter-prometheus:- Runtime (paper §Evaluation):
substrate_received_total{tier},dropped_total{tier=t1},decode_errors_total{tier},t3_no_handler_total,latency_us{tier}histograms,tick_hz/entities/channel_depth{tier}/rss_bytesgauges. - Sensor data (operator surface):
sensor_aggregate{type, stat=count|mean|min|max}aggregated per second across the live ECS world. Cardinality bounded to\|SensorType\| × 4series independent of physical sensor count. - Dashboards: dashboards/runtime.json + dashboards/sensors.json.
- Verified:
--profile industrial --devices 2 --count 200yields 10 entities and all 5 type aggregates with realistic values (T=20.5°C, RH=51%, P=1018 hPa, V=230.2 V, I=12 A).
- Runtime (paper §Evaluation):
- M6 — Benchmark harness. Sweep
entity_count ∈ {10k, 50k, 100k, 200k}×loss_rate ∈ {0%, 1%, 5%}with 2k warmup + 5k measurement ticks. Loss viatc netem. Writesdata/loopback/final_table.csv. Verify: one full sweep on M4 Max produces a CSV the Quarto figures consume. - M7 — CM5 cross-compile & deploy. Exercise Makefile:30 (
build-cm5,deploy-cm5); set realCM5_HOST. Verify: binary runs on CM5 with a feed from M4 Max over 1 Gbps Ethernet. - M8 — Two-machine run + paper render. Sweep with simulator on M4 Max → substrate on CM5; populate
data/two_machine/final_table.csv;make renderproduces a PDF. Update §Evaluation prose to reflect actual numbers. Current paper figures (241 Hz, 64 µs / 15.8 ms P99, 2.6 µs jitter, 1.02 MB/1k, R²=0.9999) are aspirational placeholders — they may move and the conclusions may shift; that's expected.
Conventions
- Rust: edition 2024; workspace at root with
simulator+substrate;opt-level=1dev,opt-level=3for deps. - Pinned crates: Bevy 0.18, Quinn 0.11, rustls 0.23, Tokio 1 (full), figment 0.10 (toml + env), uuid 1.23 (v4), serde 1.
- Config:
figmentchain — defaults in substrate/src/config.rs:25 →config.toml→ envAPP_*(double-underscore for nesting, e.g.APP_NETWORK__SERVER_PORT=9000). - Bevy: headless —
MinimalPluginsonly; do not pull rendering plugins. - Tokio↔Bevy: keep the dedicated-thread + mpsc pattern in substrate/src/transport/ecs.rs:49; do not block the ECS schedule on async work.
- Paper: Quarto + LNCS template (paper/_extensions/template.tex, paper/_quarto.yml). Never commit
llncs.clsorsplncs04.bst— CTAN licensing; download per README.md:25-34. - Data: raw CSVs under
data/are committed;*_processed.csvis gitignored. Paper figures consumedata/loopback/final_table.csvanddata/two_machine/final_table.csv. - Build artifacts:
target/,paper/_output/,paper/figures/,paper/.quarto/,paper/index.texall gitignored. - Errors:
anyhow(with.context()) for internal startup paths where the error type is uninteresting;thiserrorfor boundary types we want to match against (e.g.WireErrorin the codec). - Warnings: let real warnings show. No
#[allow(dead_code)],_varblanket suppression, orPhantomDatashims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands. See feedback memory.
Known deferrals
- Channel ownership is per-host, not per-connection. All connections share the same inbound mpsc channels and the same outbound T3 channel. Fairness under N-device load relies on tokio scheduling. Acceptable for the "one ECS world per host" model the paper describes; revisit if many-device benchmarks show starvation.
- No graceful shutdown. The
quic-runtimethread is parked onpending(); spawned tasks (accept loop, per-conn demux, outbound drain, per-command T3 spawns) are orphaned at process exit. Fine for research runs. - Bind failure is fatal.
OnEnter(Starting)panics ifbind_endpointfails. AServerState::Failedvariant joins when we wire proper error surfacing. - T3 outbound concurrency is unbounded.
drain_outbound_t3spawns one task per command (so a stuckread_exactcan't stall the pipeline). Under sustained T1 ingest beyond ~10k msg/s the per-command tasks queue behind the tokio scheduler and T3 P99 latency climbs into the hundreds of ms while throughput holds. If we need true latency isolation under load, add atokio::Semaphorecap or a dedicated runtime/thread for T3. - Schedule rate-gating is approximate.
ScheduleRunnerPlugin::run_loop(period)honoursperiodas a minimum; observedtick_hzruns ~85% of target on macOS dev (target 60 → ~50). Should be tighter on the CM5; revisit if M6 sweeps depend on a steady tick.
Run / verify
make certs # generate certs/server.{crt,key} (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build # cargo build --release (native, depends on certs)
make build-cm5 # aarch64 cross-build for the CM5 (depends on certs)
make deploy-cm5 # scp to $CM5_HOST (set in env or override Makefile var)
make render # build the paper PDF
make preview # live-reload paper preview at :4848
make clean # cargo clean + drop generated paper outputs
certs/ is gitignored; make build regenerates the dev cert if missing. From the repo root: cargo run -p substrate boots, prints the loaded AppConfig, and idles. config.toml and cert paths are resolved relative to the cwd — always launch from the repo root.
Tests. cargo test --workspace runs the codec unit tests in substrate plus the end-to-end integration tests in simulator/tests/. Each integration test calls bind_endpoint + accept_loop in-process on 127.0.0.1:0 (OS-assigned port), connects a SimulatorClient against it, and asserts what arrives on the test-owned T1 receiver. Add a new simulator/tests/end_to_end_*.rs for each new wire path (T2 uni, T3 bi) as the substrate-side demux lands.
Metrics scrape. With metrics_enabled = true (default), the substrate exposes a Prometheus-format endpoint:
curl http://127.0.0.1:9100/metrics
A docker-compose stack under monitoring/ brings up VictoriaMetrics + Grafana auto-provisioned: make monitoring-up then Grafana at http://localhost:3000 (admin / admin), both dashboards under the quic_ecs_dt folder. The compose mounts dashboards/ directly so any edit to the JSON files re-imports within 10 s.
Two Grafana dashboards under dashboards/:
runtime.json— tick rate, RSS, per-tier received/dropped/latency, channel depth (paper §Evaluation surface).sensors.json— thermometer + gauges + stat panels perSensorType, driven bysensor_aggregate{type, stat}(operator-facing surface).
Both use the ${datasource} template variable so you can point them at any Prometheus-compatible source.
Manual two-process run. From the repo root, in two shells:
# shell 1 — server (use RUST_LOG=substrate=debug to see the per-conn summary)
cargo run -p substrate
# shell 2 — client; --help shows all flags
cargo run -p simulator -- --rate-hz 100 --count 0 --devices 4
Simulator flags (see cargo run -p simulator -- --help): --addr, --server-name, --cert, --rate-hz (T1 datagram rate; 0 disables T1), --t2-rate-hz / --t3-rate-hz (per-tier event rate; 0 disables), --t3-timeout-ms (T3 ack wait, default 2000), --count (T1 count; 0 = until Ctrl-C), --devices, --sensor-id, --sensor-type (one of generic|temperature|humidity|pressure|voltage|current), --profile (single or industrial — 5 sensors per device on ids 0..4 covering all types). The client logs a one-second progress line with t1_sent/t2_sent/t3_sent/t3_timeouts/per-tier observed Hz, and a final simulator done line with elapsed time on exit.
Key references
- Prior self-citations:
plantevin2026ecs,plantevin2026quic(both IEEE SWC 2026, "to appear"). - QUIC: RFC 9000 (core), RFC 9221 (unreliable datagrams).
- DT foundations: Tao et al. 2019; Grieves & Vickers 2017; Minerva et al. 2020.
- ECS: Nystrom 2014, Game Programming Patterns.
- Mixed-reliability transport: Peeck et al. (W2RP for DDS).
- DT sync metrics: Çakır et al. 2023 (Twin Alignment Ratio); Bellavista et al. 2023 (ODTE).
- Industrial QUIC/IIoT: Fernández et al. 2021; Boeding et al. 2025.
- Full bibliography: paper/references.bib.