Files
quic_ecs_dt/CLAUDE.md
2026-05-13 17:22:10 -04:00

24 KiB
Raw Blame History

quic_ecs_dt — Project Guide for Claude

What & why

Source repo for "QUIC and ECS as Complementary Transport and Runtime Substrates for Industrial Digital Twins: An Integrated Empirical Study" — submitted to UCAmI 2026 (Track 2: Internet of EveryThing (IoT, People & Processes) and Sensors; primary topic IoE interoperability, integration and performance, secondary topic IoE experimental results and deployment scenarios). Single-author (Plantevin, UQAC). Third paper in a sequence; the first two are at IEEE SWC 2026:

  • plantevin2026ecs — ECS as runtime substrate for industrial DT (200k assets @ 114 Hz on Pi 5).
  • plantevin2026quic — QUIC partial reliability for DT sensor streams (94% P99 reduction vs TCP at 5% loss).

UCAmI hypothesis (the composition question): prior work shows ECS and QUIC each work as substrates independently. Does integrating real QUIC traffic into a Bevy ECS ingest path introduce coupling that degrades either one's claimed properties? The paper argues no, and measures it on a real CM5 ↔ M4 Max two-machine deployment.

Architecture

Three-tier QUIC ↔ ECS bridge, headless Bevy runtime. T1/T2 are inbound (device → substrate); T3 is outbound (substrate → device, actuator commands):

Tier QUIC primitive Direction Use case Channel cap Sender
T1 Unreliable datagrams (RFC 9221) device → substrate High-freq ephemeral telemetry; drops OK 1024 T1Sender::send_lossy (try_send, drop on full)
T2 Unidirectional streams device → substrate Ordered threshold events; reliable 512 T2Sender::send (await, backpressure)
T3 Bidirectional streams substrate → device Actuator commands w/ ACK 256 T3OutboundSender::try_send of OutboundT3 { target_device, sensor_id, raw_value, sensor_type }

QUIC server runs on a dedicated OS thread with a Tokio multi-thread runtime. T1/T2 decoded QuicMessages (39 B fixed LE: 16 UUID + 2 sensor_id + 8 f64 + 8 ts + 4 seq + 1 sensor_type) flow into per-tier tokio::sync::mpsc channels and are drained by Bevy's ingest_system in PreUpdate, gated by run_if(in_state(ServerState::Started)). T3 flows the other way: automation_system constructs OutboundT3 items and the tokio-side drain_outbound_t3 task opens bi-streams to the target device. The per-tier sender newtypes (in substrate/src/transport/mod.rs) make tier mixups a type error. Pattern in substrate/src/transport/ecs.rs.

T3 actuator-command protocol. The substrate's automation_system decides to actuate (e.g. Presence < 1.0 ⇒ Relay = stop) and pushes an OutboundT3 onto the outbound channel. The tokio drain_outbound_t3 pops it, looks up the target device's quinn::Connection in a ConnectionRegistry (populated by read_datagrams / read_one_uni_stream on first sight of each device UUID), then spawns one task per command to do conn.open_bi() → write 39 B → finish → read 39 B ack. Per-task spawning means a single stuck read_exact can't stall the pipeline. Latency from open_bi() to ack-receipt is recorded as substrate_latency_us{tier="t3"} and a successful ack increments substrate_received_total{tier="t3"}. Misses (substrate_t3_outbound_no_route_total), drops (substrate_t3_outbound_dropped_total), and bi-stream errors (substrate_t3_outbound_errors_total) each have their own counter.

Connection registry. Arc<std::sync::RwLock<HashMap<Uuid, quinn::Connection>>>. quinn::Connection is internally Arc; one simulator process commonly hosts 7 device UUIDs sharing one connection. Registry insert is idempotent (ensure_registered). On conn.closed().await returning, handle_incoming purges every key whose Connection::stable_id() matches the closed connection.

Target hardware: CM5 (BCM2712, Cortex-A76, 4 GB) as DT runtime; M4 Max as traffic generator; 1 Gbps direct Ethernet. Both rigs are in hand; benchmark sweeps live on the CM5.

Repo map

quic_ecs_dt/
├── paper/                Quarto + LNCS source — single index.qmd, refs in references.bib
├── substrate/            Rust crate: Bevy 0.18 + Quinn 0.11 + rustls 0.23 + Tokio
│   └── src/
│       ├── main.rs        App::new, MinimalPlugins, EcsQuicTransportPlugin, ObservabilityPlugin
│       ├── lib.rs         re-exports
│       ├── config.rs      figment chain: defaults → config.toml → APP_* env (split on "__")
│       ├── observability.rs  metrics-exporter-prometheus on :9100
│       ├── transport/
│       │   ├── mod.rs     QuicMessage codec + tier sender newtypes + OutboundT3
│       │   ├── ecs.rs     EcsQuicTransportPlugin: tokio thread + bridge + registry + drain spawn
│       │   ├── server.rs  bind_endpoint + accept_loop + read_datagrams + read_uni_streams
│       │   │              + drain_outbound_t3 + synthetic_t3_driver + ConnectionRegistry
│       │   └── state.rs   ServerState{Starting, Started}
│       └── world/
│           ├── mod.rs         WorldPlugin (5 systems wired into Pre/Update/Post)
│           ├── components.rs  Asset, DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue, threshold_for
│           ├── resources.rs   SensorRegistry, DiagnosticsState, ExportSampleState
│           ├── systems.rs     ingest, simulation, automation, export, diagnostics
│           └── tests.rs       8 unit tests inc. automation_dispatches_relay_stop
├── simulator/            Rust crate: Quinn client + sensor generators + T3 receiver
│   ├── src/
│   │   ├── main.rs        CLI driver + HTTP-trigger task + T1 inline loop
│   │   ├── lib.rs         module exports
│   │   ├── client.rs      SimulatorClient (connect, send_datagram, send_uni_stream, request, close)
│   │   ├── commands.rs    run_command_receiver (substrate → device T3 accept-bi loop)
│   │   ├── emitters.rs    run_t2_emitter (T1 lives inline in main.rs)
│   │   └── profile.rs     SensorProfile (single | industrial), generate_value
│   └── tests/             T1, T2, end-to-end full-loop integration tests
├── data/
│   ├── two_machine/       CM5 ↔ M4 Max sweep — final_table.csv (load-bearing for the paper)
│   └── local/             loopback sweeps (scaling.csv, cross_tier.csv)
├── scripts/
│   ├── bench-loss.sh      M6 sweep entities×loss → data/two_machine/final_table.csv
│   ├── bench-scaling.sh   T1 rate sweep + optional synthetic-T3 cross-tier mode
│   ├── bench-client.sh    M8 client driver (run from Mac when substrate is on CM5)
│   ├── demo.sh            full-stack demo: certs + build + VM/Grafana + sub + sim
│   ├── setup-cm5.sh       CM5 provisioning (apt + cargo install)
│   └── verify-netem.sh    confirm tc-netem is shaping in the right direction (BIDI=1 for ifb mode)
├── monitoring/            docker-compose: VictoriaMetrics + Grafana auto-provisioned
├── dashboards/            runtime.json + sensors.json
├── certs/                 gitignored, regenerated by `make certs`
├── Cargo.toml             workspace
└── Makefile               render, preview, build, build-cm5, deploy-cm5, monitoring-up

Status

Code (substrate + simulator):

Area State
AppConfig figment loader (defaults → TOML → env with __ split) Done — substrate/src/config.rs. Env override actually works (Env::prefixed("APP_").split("__")); discovered late that the previous chain silently ignored env vars
39 B wire codec Done — substrate/src/transport/mod.rs, 5 unit tests
Quinn server lifecycle + TLS Done — bind_endpoint + accept_loop in substrate/src/transport/server.rs; ServerState{Starting, Started} in state.rs; explicit TransportConfig w/ 256 KiB datagram recv buffer; dev cert via make certs, rustls aws-lc-rs provider installed in main.rs
T1 demux (datagrams → ECS) Done. read_datagrams reader; decode errors non-fatal; channel-full drops silent; per-stream counters in debug summary. Calls ensure_registered on first decode so outbound T3 can route to this device
T2 demux (uni streams → ECS) Done. read_uni_streams accepts streams, spawns one task per stream that reads 39 B chunks until EOF; decode failure resets the stream via recv.stop(0); t2.send().await honours backpressure; first decode also calls ensure_registered
T3 outbound (ECS → device) Done. drain_outbound_t3 task pops OutboundT3 items, looks up the target device's Connection in ConnectionRegistry, spawns one task per command to do open_bi → write 39 B → finish → read ack. Per-task spawning prevents a single stuck read_exact from stalling the pipeline. Records substrate_latency_us{tier="t3"} on success; counts no-route / dropped / errors separately. The old simulator-initiated T3 inbound path (T3Sender / T3Inbound / accept_bi_streams) is gone
Connection registry (Uuid → Connection) Done — Arc<RwLock<HashMap<Uuid, quinn::Connection>>>; idempotent insert via ensure_registered; purged in handle_incoming after conn.closed().await using Connection::stable_id()
Synthetic T3 driver (bench only) Done. synthetic_t3_driver task in server.rs spawned by accept_loop when APP_NETWORK__SYNTHETIC_T3_RATE_HZ > 0. Round-robins over registered devices, toggles raw_value between 0/1, pushes through the same outbound channel automation_system uses
ECS components + 5 systems Done — world/. Entities = (Asset, DeviceId, SensorId, SensorTypeTag, RawSensorData, SmoothedValue) per (device, sensor). 5 systems: ingest (PreUpdate, drains T1+T2), simulation (Update, rolling mean + threshold-crossings counter), automation (Update, Presence-cross → t3_out.try_send(OutboundT3{Relay setpoint}) + local mirror), export (PostUpdate, per-second metric sample), diagnostics (PostUpdate, per-second tick_hz log)
Schedule rate-gating Done — MinimalPlugins.set(ScheduleRunnerPlugin::run_loop(1/tick_rate_hz)) in main.rs
Prometheus exporter + Grafana Done. metrics-exporter-prometheus on :9100 via ObservabilityPlugin. Runtime metrics: substrate_received_total{tier}, substrate_dropped_total{tier=t1}, substrate_decode_errors_total{tier}, substrate_t3_outbound_*_total, substrate_latency_us{tier} histograms, substrate_tick_hz, substrate_entities, substrate_channel_depth{tier}, substrate_rss_bytes. Sensor data: sensor_aggregate{type, stat=count|mean|min|max}. Dashboards: dashboards/runtime.json + dashboards/sensors.json
Simulator binary Done — simulator/src/main.rs. Clap flags: --addr, --server-name, --cert, --profile {single, industrial}, --sensor-type, --sensor-id, --rate-hz, --t2-rate-hz, --count, --devices. industrial profile fans out to 7 sensors per device on ids 0..6 (Temperature/Humidity/Pressure/Voltage/Current/Presence/Relay). HTTP trigger on :9002 (POST /trigger) pushes Presence=0 over T2 — operator-facing demo entry point. T1/T2 emitters check engine_running per tick; when false, Current waveform drops to ~0 while Voltage stays at ~230 V
Simulator command receiver Done — simulator/src/commands.rs. run_command_receiver loops on conn.accept_bi(), decodes 39 B, flips engine_running on sensor_type == Relay setpoints, writes 39 B ack. Spawned by main.rs post-connect. new_engine_state() constructor exported for integration tests
End-to-end test harness 18 tests, all green. 5 codec unit tests; 8 world unit tests (incl. automation_dispatches_relay_stop_when_presence_drops); 2 T1 + 2 T2 integration tests; 1 full closed-loop test (simulator/tests/end_to_end_full_loop.rs: Presence < 1.0 → substrate T3 → engine_running flips to false; then Presence > 1.0 → flips back)
Benchmark scripts Done. bench-loss.sh — entity × loss sweep, bidirectional tc-netem via ifb on the CM5 (BIDI=1 default). bench-scaling.sh — T1 rate sweep + optional substrate-side APP_NETWORK__SYNTHETIC_T3_RATE_HZ. verify-netem.sh — sanity-check netem on the right interface in the right direction (BIDI=1 mode covers ingress via ifb)
CM5 deploy Done — make build-cm5 && make deploy-cm5; setup-cm5.sh provisions deps. Bench has been run end-to-end on CM5; data lives in data/two_machine/final_table.csv

Paper:

Area State
Track + topics chosen Done — UCAmI Track 2 (IoE and Sensors); primary IoE interoperability, integration and performance; secondary IoE experimental results and deployment scenarios
Abstract Done. Honest framing: "tick rate remains an order of magnitude above the cadence required" (not "stable"), mixed-reliability isolation as the T1-vs-T3 story, 0.12 MB/1k slope
Tables 2/3/4 from real CM5 data Done. Native markdown tables driven by inline {python} values reading from data/two_machine/final_table.csv; cross-refs (@tbl-latency, @tbl-throughput, @tbl-t3-rtt) resolve in the LNCS LaTeX output. Earlier display(Markdown(...)) approach didn't register with Quarto's cross-ref filter; switched to native md tables with inline-python cells
fig-isolation Dropped. Cross-tier story now told by tbl-latency + tbl-t3-rtt (T1 flat under loss, T3 absorbs ~38 ms retransmit). Cleaner than the loopback fig. data/local/cross_tier.csv is still on disk but the paper no longer reads it
Architecture §3 + Table 1 Updated for substrate-initiated T3. Table 1 T3 row reads "OutboundT3 enqueue + ack | Bidirectional stream (server-initiated)"; the connection-registry / per-device routing is described in the prose
Implementation §4 Automation paragraph Updated for the new outbound T3 path; describes the per-device registry, the per-command bi-stream, and the simulator-side run_command_receiver engine-state flip
Discussion + Conclusion Honest now: drops the unbacked "<5% IngestSystem drain" and "Grafana adds no overhead" claims; conclusion populates both 0%-loss and 5%-loss Hz from data
Render Clean against LNCS LaTeX template (make render → 10-page PDF, no Quarto warnings)

Roadmap

Treat the milestone log as historical. The paper-side work below tracks what's left before camera-ready.

  • M1 — Wire codec & root config. 2026-05-04.
  • M2 — Quinn server + TLS. 2026-05-06.
  • M3 — Simulator client. Done. SimulatorClient + CLI driver + waveform profiles + HTTP trigger + closed-loop command receiver.
  • M4 — ECS world. Done. 5 systems wired; automation closes the T3 loop.
  • M5 — Observability. Done. Both dashboards live; metrics exposed via prometheus scrape.
  • M6 — Benchmark harness. Done. bench-loss.sh + bench-scaling.sh + verify-netem.sh (last one added when egress-only netem was masking the inbound T1 loss path; now ifb ingress shaping is default).
  • M7 — CM5 cross-compile & deploy. Done. Multiple sweeps shipped from CM5.
  • M8 — Two-machine run + paper render. Done. Paper renders against data/two_machine/final_table.csv; all inline scalars and tables populate from real numbers.
  • M9 — T3 inversion (substrate-initiated actuator commands). 2026-05-13. The paper's Table 1 said T3 was "actuator commands" but the code had it inverted (device → substrate RPC). Refactored to match the paper: substrate opens bi-streams, simulator's run_command_receiver accepts. Full closed-loop integration test landed.
  • M10 — Abstract submission polish. In progress. Top-of-paper fixes shipped (abstract framing, contributions paragraph, Table 1 T3 row, Architecture §3 backpressure paragraph, author affiliation, (author?) cite markers). Remaining polish is full-paper-only (Implementation §4 module-list lies, code listing with fake types, Observability §4.2 push-vs-pull mismatch, Experimental Setup §5.1 stale tc-netem / tick counts / loopback-vs-two-machine sentence). None block abstract submission.

Open polish items (not blocking abstract submission):

  • §4.1 Integrated Prototype still lists six systems including a non-existent FaultInjection; module list says transport.rs / world.rs / metrics.rs / main.rs but the actual layout is transport/, world/, observability.rs, config.rs, main.rs, lib.rs plus a separate simulator crate.
  • §4.1 code listing uses fictional types (AssetId, EntityMap, TickDiagnostics). Easier to drop the listing than to rewrite faithfully.
  • §4.2 Observability Stack describes a push model with InfluxDB line protocol; actual code uses metrics-exporter-prometheus exposing /metrics for VM scrape.
  • §5.1 Experimental Setup needs three updates: tc-netem direction (now bidirectional via ifb), "2,000 warmup ticks and 5,000 measurement ticks" → "20 s warmup + 50 s window (wall-clock)", and drop the "loopback for latency / two-machine for throughput" sentence (all numbers are from the two-machine sweep now).

Conventions

  • Rust: edition 2024; workspace at root with simulator + substrate.
  • Pinned crates: Bevy 0.18, Quinn 0.11, rustls 0.23, Tokio 1 (full), figment 0.10 (toml + env), uuid 1.23 (v4), serde 1.
  • Config: figment chain — defaults → config.toml → env APP_* with __ nesting (e.g. APP_NETWORK__SERVER_PORT=9000, APP_NETWORK__SYNTHETIC_T3_RATE_HZ=100).
  • Bevy: headless — MinimalPlugins only; do not pull rendering plugins.
  • Tokio↔Bevy: keep the dedicated-thread + mpsc pattern in transport/ecs.rs; do not block the ECS schedule on async work.
  • Paper: Quarto + LNCS template (paper/_extensions/template.tex, paper/_quarto.yml). Never commit llncs.cls or splncs04.bst — CTAN licensing; download per README.md. For tables in LaTeX target, use native markdown tables with : Caption {#tbl-foo} syntax and inline {python} cells, not display(Markdown(...)) chunks — Quarto's cross-ref filter doesn't pick the latter up in LaTeX output.
  • Data: raw CSVs under data/ are committed; *_processed.csv is gitignored. Paper figures consume data/two_machine/final_table.csv exclusively (the previous data/loopback/ was renamed to data/two_machine/ once it became the real CM5 sweep).
  • Errors: anyhow (with .context()) for internal startup paths; thiserror for boundary types we want to match against (e.g. WireError in the codec).
  • Warnings: let real warnings show. No #[allow(dead_code)], _var blanket suppression, or PhantomData shims to silence the compiler — warnings are honest TODO markers and disappear when the consuming code lands.

Known deferrals

  • Channel ownership is per-host, not per-connection. All connections share the same inbound mpsc channels and the outbound T3 channel. Fairness under N-device load relies on tokio scheduling. Acceptable for "one ECS world per host".
  • No graceful shutdown. The quic-runtime thread parks on pending(); spawned tasks orphan at process exit. Fine for research runs.
  • Bind failure is fatal. OnEnter(Starting) panics if bind_endpoint fails.
  • T3 outbound concurrency is unbounded. drain_outbound_t3 spawns one task per command. Under sustained T1 ingest beyond ~10k msg/s the per-command tasks queue behind the tokio scheduler and T3 P99 climbs into the hundreds of ms (throughput still holds). If we ever need strict T3 latency isolation under heavy T1 load, add a tokio::Semaphore cap or a dedicated runtime/thread for T3.
  • NTP drift over a long bench shifts the across-row T1 P99 baseline. Visible in tbl-latency (47 ms at 50k → 28 ms at 200k). The within-row Δ is what speaks to isolation; the across-row absolutes don't. Paper caption explains this.
  • Schedule rate-gating is approximate. Observed tick_hz runs ~85% of target on macOS dev; tighter on the CM5.

Run / verify

make certs              # dev TLS (ECDSA P-256, SAN: localhost/cm5.local/127.0.0.1/::1)
make build              # cargo build --release native
make build-cm5          # aarch64 cross-build
make deploy-cm5         # scp to $CM5_HOST
make render             # paper PDF
make preview            # live-reload paper at :4848
make monitoring-up      # docker-compose VM + Grafana

Tests. cargo test --workspace runs codec unit tests + world unit tests + 5 integration tests (T1, T2, full closed-loop) in simulator/tests/. Each integration test calls bind_endpoint + accept_loop in-process on 127.0.0.1:0. The full-loop test stands up the real outbound machinery (accept_loop + drain_outbound_t3) and asserts the engine-state flag flips in both directions.

Metrics scrape. With metrics_enabled = true (default):

curl http://127.0.0.1:9100/metrics

make monitoring-up brings up VictoriaMetrics + Grafana auto-provisioned at http://localhost:3000 (admin / admin); the dashboards mount live from dashboards/ so JSON edits re-import within ~10 s.

Full-stack demo. scripts/demo.sh brings up certs + cargo build + monitoring stack + substrate + simulator and tails the simulator's progress log. Industrial profile by default; Presence dips below threshold every few seconds, triggering substrate-initiated T3 Relay setpoints, visible on the operator dashboard as Current collapsing to ~0 A while Voltage holds.

./scripts/demo.sh                                    # defaults
PROFILE=single RATE_HZ=100 DEVICES=20 ./scripts/demo.sh
KEEP_MONITORING=1 ./scripts/demo.sh                  # leave VM + Grafana running on exit

Manual two-process run. From the repo root:

# shell 1 — server
cargo run -p substrate

# shell 2 — client
cargo run -p simulator -- --profile industrial --rate-hz 100 --count 0 --devices 4

Simulator flags (see cargo run -p simulator -- --help): --addr, --server-name, --cert, --profile {single, industrial}, --sensor-type, --sensor-id, --rate-hz (T1 datagram rate; 0 disables T1), --t2-rate-hz (T2 event rate; 0 disables T2), --count (T1 count; 0 = until Ctrl-C), --devices. No simulator-side T3 flag — T3 is substrate-initiated. Per-second progress lines show t1_sent/t2_sent/engine={running,stopped}.

Bidirectional netem on the CM5. scripts/bench-loss.sh applies tc netem loss N% bidirectionally via an ifb ingress-redirect (BIDI=1 default). scripts/verify-netem.sh confirms it lands on the right interface:

./scripts/verify-netem.sh <peer-ip> end0 5          # egress only
BIDI=1 ./scripts/verify-netem.sh <peer-ip> end0 5   # both directions via ifb

Key references

  • Prior self-citations: plantevin2026ecs, plantevin2026quic (both IEEE SWC 2026, "to appear").
  • QUIC: RFC 9000 (core), RFC 9221 (unreliable datagrams).
  • DT foundations: Tao et al. 2019; Grieves & Vickers 2017; Minerva et al. 2020.
  • ECS: Nystrom 2014, Game Programming Patterns.
  • Mixed-reliability transport: Peeck et al. (W2RP for DDS).
  • DT sync metrics: Çakır et al. 2023 (Twin Alignment Ratio); Bellavista et al. 2023 (ODTE).
  • Industrial QUIC/IIoT: Fernández et al. 2021; Boeding et al. 2025.
  • Full bibliography: paper/references.bib.