wifi-densepose/docs/adr/ADR-119-mlp-classifier.md

# ADR-119 — MLP Replaces Logistic Regression in Adaptive Classifier

**Status**: Accepted
**Date**: 2026-05-18
**Scope**: `v2/crates/wifi-densepose-sensing-server/src/adaptive_classifier.rs`
(new `MlpModel` struct, `train_mlp_classifier`, `eval_mlp`; modified
`AdaptiveModel::classify` + `train_from_recordings`).

## Context

After ADR-118 (feature decorrelation + multi-node extractor) the adaptive
classifier reached **49.58% accuracy** on a 6-node, 7-class, 151,329-frame
training set. Per-feature audit showed `n6_std` sep_ratio = 0.60 — i.e. the
underlying signal *can* separate the classes — but logistic regression was
limited to linear decision boundaries and couldn't model interactions like:

* `walking`: `n2_std` high **AND** `n6_std` high **AND** `dom_hz ≈ 3 Hz`
* `waving`: `n1_std` high **BUT** `n2_std` low (only close sensors fire)
* `sitting` vs `standing`: same global features, differ in `n6_std` pattern

LogReg sums weighted features; it cannot represent "AND/BUT" combinations.
A small MLP can: hidden units learn intermediate concepts, then the output
layer combines them.

## Decisions

### D1 — Single-hidden-layer MLP, 22 → 32 → 6

* Input: the same 22-feature vector from ADR-118.
* Hidden: 32 ReLU units. ~3k weights, enough capacity for 6 classes but
  small enough to train in seconds on the 151k-frame set.
* Output: softmax over `n_classes` (discovered dynamically at train time).
* Z-score normalisation: identical to the LogReg path — same
  `global_mean` / `global_std` populated by `train_from_recordings`.

### D2 — Manual backprop, no external ML crate

`tch` (LibTorch) or `candle` would pull in ~50-200 MB of native deps for a
~3k-parameter network. The forward + backward passes are ~150 LoC of pure
Rust; SGD + momentum + cosine LR decay another ~30. Built-in `f64`
arithmetic is fast enough — full train completes in ~10 seconds on M1
Mac.

Optimiser: SGD with momentum 0.9, weight decay 1e-4, base LR 0.05 with
half-cosine decay to 0, batch size 64, 30 epochs. He initialisation
(`N(0, sqrt(2/fan_in))`) on weights, zero on biases.

### D3 — MLP wins over LogReg at classify time, LogReg kept as fallback

`AdaptiveModel` carries both:

```rust
pub weights: Vec<Vec<f64>>,   // legacy LogReg, still trained for rollback
pub mlp: MlpModel,            // ADR-119 — preferred when is_trained() == true
```

`classify()` checks `self.mlp.is_trained()`; if yes uses MLP forward pass,
otherwise falls back to LogReg softmax. Old `data/adaptive_model.json`
files (15-feature LogReg) loaded with `#[serde(default)]` on `mlp` →
`MlpModel::default()` returns empty fields → `is_trained() == false` →
graceful degradation to LogReg path.

### D4 — Train both, report better number

`train_from_recordings` runs the existing LogReg loop first (unchanged),
then trains MLP on the same z-normalised samples, evaluates both on the
training set, and reports `training_accuracy = mlp_acc.max(logreg_acc)`.
Per-class accuracy from both classifiers is logged side-by-side for
diagnostic comparison.

## Verified Acceptance

```
LogReg:    49.58% overall
MLP:       53.53% overall  (+3.95 pts)

Per-class (LogReg → MLP):
  absent          40% → 41%   (+1)
  present_still   99% → 99%   (tied — 2× sample count)
  transition      29% → 36%   (+7)
  active          22% → 30%   (+8)
  waving          34% → 38%   (+4)
  present_moving  24% → 33%   (+9)
```

Notes:

* `present_still` class is a merged bucket: both `train_standing_*` and
  `train_sitting_*` map to `present_still` via `classify_recording_name`.
  Hence 43,242 samples vs 21,500 average for the other classes — the
  classifier biases strongly toward this dominant class. The 99% is
  honest but partially inflated by class imbalance.
* The +3.95 pts is concentrated on motion classes — exactly where the
  hypothesis predicted MLP would help (non-linear combinations of per-
  node features differentiate similar motion types).
* MLP loss flatlined around 1.15 after epoch 10. Suggests the current
  22-feature representation has hit its information ceiling for frame-
  level classification. Going higher needs temporal context (sliding
  window classifier, LSTM, TCN) — see Open Items.

Total improvement since the start of this session:

```
2-node, 15 features, LogReg:    40.4%   (baseline)
6-node, 15 features, LogReg:    44.4%   +4.0 from more data
6-node, 22 features, LogReg:    49.58%  +5.2 from feature engineering (ADR-118)
6-node, 22 features, MLP:       53.53%  +3.95 from non-linear classifier (ADR-119)
                                ─────
Total cumulative:               +13.1 percentage points
```

## Files Touched

```
v2/crates/wifi-densepose-sensing-server/src/adaptive_classifier.rs:
  + const MLP_HIDDEN: usize = 32
  + pub struct MlpModel { w1, b1, w2, b2, n_classes } + serde
  + impl MlpModel { is_trained, forward }
  + AdaptiveModel.mlp field (serde-default for backward compat)
  + AdaptiveModel::classify prefers MLP when trained
  + train_mlp_classifier (~150 LoC manual backprop)
  + eval_mlp helper
  + train_from_recordings calls MLP path and picks max accuracy
docs/adr/ADR-119-mlp-classifier.md  (this)
```

`data/adaptive_model.json` removed at deploy time — the MLP fields need
populating, the old file has none.

## Out of Scope / Follow-ups

* **Temporal classifier (sliding window LSTM/TCN)** — loss flatlines at
  ~1.15 with the current feature set; this is the frame-level ceiling.
  A model that consumes a 1-second window (10-20 frames) would catch
  the temporal signature of `transition` (sit-stand cycle ≈ 0.5 Hz),
  `walking` (step rate ≈ 2 Hz), `active` (bursty), `waving` (limb
  cadence ≈ 1-2 Hz). Estimated +15-25 pts realistic for these
  inherently-temporal classes. ~3-4 hours of code.
* **Class imbalance fix** — `present_still` has 2× samples. Either
  oversample the minority classes during training, or weight loss by
  inverse class frequency. Marginal — ~2-3 pts.
* **Drop dead features** — 6 entropy features (sep_ratio 0.01-0.02) and
  3 weak globals (`mean_rssi`, `dom_hz`, `change_pts` all <0.11)
  contribute noise. Reducing 22 → ~13 features would simplify training
  but probably not move accuracy more than 1-2 pts.
* **Hidden size sweep** — tried only 32. Could try 16 (faster, less
  overfitting risk) or 64 (more capacity). Cosmetic.
* **Split `sitting` and `standing` into separate classes** — they're
  physically distinct RF signatures but currently merged. Adding them as
  separate classes would test whether the model can disambiguate them.
  Likely lowers `present_still` accuracy but separates a useful
  distinction. Experiment-grade.

## References

* ADR-118 — feature decorrelation + multi-node extractor (the 22-feature
  basis this ADR uses)
* ADR-117 — earlier process hygiene pass; introduced standardisation
  (`global_mean`/`global_std`) that this ADR's MLP also relies on
* ADR-101 — raw amplitude classifier (the runtime path that calls
  `AdaptiveModel::classify`)