Northline Robotics — ResearchNLR · Field Notes · 00:00:00Z

Research / Data & MLOps

Data & MLOps

From mowing runs to models

Every pass a robot cuts is also a recording. The hard part is not collecting that data — it is turning a torrent of raw sensor logs into labeled examples, sharper models, and a measurably better robot, on a loop that compounds. Here is how the data flywheel actually turns.

A robotic mower spends its summer doing the same thing thousands of times: drive a line, sense the world, decide, repeat. Most of that effort is thrown away the instant it happens. But it does not have to be. Every run is also a high-rate recording of a robot interacting with the messy real world — exactly the data that, handled well, makes the next robot better. The discipline of capturing that exhaust and feeding it back into the models is what separates a clever demo from a system that improves every season without anyone reinventing it.

This article is the data-processing cornerstone of the series. It walks through the loop a fleet runs to convert operational runs into labeled data, models, and improved behavior — the engineering pipeline that ingests sensor logs, the labeling tricks that make annotation affordable, the active-learning math that decides what is worth a human's attention, and the MLOps that ships a new model to the field safely. None of it is mower-specific. It is the same flywheel for any robot fleet, on the ground or in the air.

01The data flywheel

The "flywheel" is a deliberately mechanical metaphor: a heavy wheel that is hard to start but, once spinning, stores momentum and resists slowing. In a data product the wheel has four stations, and each turn feeds the next:

  • Operations — the fleet runs in the field and generates sensor logs, outcomes, and interventions.
  • Data — those runs are ingested, curated, and turned into labeled training examples.
  • Models — better data trains better perception, prediction, and policy models.
  • Better operations — improved models reduce failures and unlock new sites, which generates more and more diverse runs — and the wheel comes back around heavier.

The compounding is the point. A robot that handles more conditions gets deployed more widely; wider deployment surfaces rarer edge cases; those edge cases, once labeled and learned, raise the floor again. The hard truth underneath the metaphor is that the wheel does not turn for free. Raw data is not knowledge — it is a liability until processed, and most of the engineering is in the between stations, not the stations themselves.

Operations Data ingest · label · curate Models train · evaluate Better ops fewer failures · new sites the fleet runs active learning uncertain → label
Fig. 1 — The flywheel. Operations produce data; data trains models; better models improve operations, which produce more and richer data. Active learning (the red spoke) routes the model's own uncertainty back to human labelers so each turn spends effort where it pays.

02From robot to repository: the data pipeline

Before any learning happens there is a sizeable data-engineering problem: getting bytes off the robot, intact and intelligible, at sane cost. A single platform might carry a camera or two, an IMU at hundreds of hertz, wheel encoders, GNSS, and sometimes lidar — together easily tens to hundreds of gigabytes per operating hour of raw streams. You cannot ship all of it, and you cannot keep all of it forever.

The pipeline has a few hard requirements:

  • Time synchronization. Sensors run on different clocks at different rates. Fusing a camera frame with the IMU sample that was true at the instant of exposure requires a common, monotonic timebase and careful timestamping at the source — hardware-triggered where possible. A label is only as good as the time alignment beneath it; a 50 ms skew between image and pose can silently corrupt a traversed-path label.
  • Storage formats. Logs are typically captured in a container such as ROS bag or MCAP for fidelity, then transcoded into columnar, query-friendly formats (Parquet for tabular telemetry, sharded records for imagery) so that training jobs can read slices without rehydrating whole sessions.
  • Tiering and triage. The cost of raw sensor data is dominated by the boring streams — long stretches of nominal driving that look like every other nominal stretch. The fleet uploads compressed summaries and metadata by default, and only the interesting windows at full fidelity: interventions, near-misses, high model uncertainty, novel scenes. This event-triggered upload is the first, crudest filter of the flywheel, and it can cut bandwidth and storage by an order of magnitude.

The output of this stage is not "data" in the loose sense — it is an indexed, time-aligned, queryable corpus where a curator can ask "show me every frame where the robot was within 30 cm of a bed boundary in low light" and get an answer in seconds. That index is the substrate everything downstream stands on.

03Auto-labeling and weak supervision

Human annotation is the single most expensive line item in most perception programs, so the highest-leverage move is to not pay for labels you can derive for free. Field robotics is unusually rich in opportunities to do exactly that, because the robot is an embodied agent whose later experience reveals the truth of its earlier observations.

The canonical example is traversed-path-as-label. When a mower drives over a patch of ground without getting stuck, that patch was — by definition — traversable. The robot's own future trajectory, projected back into the camera frame it captured a few seconds earlier, becomes a free positive label for "drivable surface." This is self-supervision in the most literal sense: the system that produces the label is the robot's later state, not a person. Stronger sensors can teach cheaper ones the same way — lidar geometry or RTK-GNSS pose can supervise a camera-only model, so the fielded robot can eventually drop the expensive sensor and still perform.

These cheap labels are weak: noisy, biased, and incomplete. Traversed-path tells you what was drivable, never what was not; the robot avoids the rose bed, so it never harvests a label proving the bed is off-limits. Weak-supervision practice is to treat such signals as plentiful-but-noisy sources to be combined and denoised, then to spend the scarce human budget on the cases the cheap labels cannot reach. Which raises the obvious question: which cases?

Key idea

The cheapest label is the one you never paid a human for. In field robotics, the robot's future — where it actually drove, what a stronger sensor actually saw — is a free, if noisy, teacher for its past observations. Auto-labeling sets the floor; human effort is reserved for the cases auto-labeling can't reach.

04Active learning: spending the labeling budget

If you can only afford to label a small fraction of what the fleet records, you should label the fraction that teaches the model the most. That is the premise of active learning: let the model nominate the examples it is most unsure about, label those, and retrain — rather than labeling at random.1 Settles' survey is the standard map of the query strategies; the workhorse of them is uncertainty sampling.

For a classifier outputting class probabilities \(p_\theta(y\mid x)\), the most general uncertainty score is the predictive Shannon entropy — high when the model spreads probability across many classes, low when it is confident in one:

$$ \mathbb{H}\big[y \mid x\big] = -\sum_{c} p_\theta(y=c \mid x)\,\log p_\theta(y=c \mid x). $$

Active learning then selects, from the unlabeled pool \(\mathcal{U}\), the example that maximizes an acquisition function \(a(x)\) — here, the entropy:

$$ x^\* = \arg\max_{x \in \mathcal{U}} \; a(x), \qquad a(x) = \mathbb{H}\big[y \mid x\big]. $$

Entropy is one of a family. Least-confidence sampling scores \(1-\max_c p_\theta(y=c\mid x)\); margin sampling scores the gap between the top two classes. For models that can express epistemic uncertainty (deep ensembles, MC-dropout), BALD-style scores prefer points where the models disagree, not merely where any single model is unsure. The unifying idea is to send the labeler to the frontier of the model's competence.

Pure uncertainty has a known failure mode: the most uncertain points are often the most useless — ambiguous, corrupted, or mislabeled outliers that no amount of labeling will resolve. A more principled (if costlier) objective is expected error reduction: choose the example whose label is expected to most reduce the model's future error over the whole pool, in expectation over its possible labels,

$$ x^\* = \arg\min_{x \in \mathcal{U}} \; \mathbb{E}_{y \sim p_\theta(y\mid x)}\!\left[\, \sum_{x' \in \mathcal{U}} \mathbb{H}\big[y' \mid x';\, \mathcal{L}\cup\{(x,y)\}\big] \right]. $$

This optimizes the thing you actually care about — generalization — rather than a proxy, but it requires a hypothetical retrain per candidate, so in practice it is approximated or reserved for small pools. Most fleets ship uncertainty sampling with a diversity/outlier guard and call it a day.

Pseudocode — the active-learning loop, end to end

# One turn of the flywheel: infer -> score -> select -> label
# -> retrain -> gate -> deploy. Runs on each batch of fleet logs.
def flywheel_turn(model, pool, labeled, budget, val):
    # 1. INFER over unlabeled field data
    probs   = model.predict_proba(pool)            # p_theta(y|x)

    # 2. SCORE uncertainty (entropy) + reject outliers
    score   = entropy(probs)                        # H[y|x]
    score   = score * novelty_ok(pool)              # 0 for garbage/dupes

    # 3. SELECT a diverse, high-score batch
    query   = diverse_topk(pool, score, k=budget)   # spend the budget

    # 4. LABEL: auto where we can, humans where we must
    auto    = self_supervise(query)                 # traversed-path, lidar->cam
    human   = annotate(query - auto.covered)        # the scarce resource
    labeled = labeled + auto + human

    # 5. RETRAIN on the grown, versioned dataset
    cand    = train(base=model, data=labeled)

    # 6. EVAL-GATE: ship only on a held-out, frozen slice
    if regresses(cand, model, val, slices=SAFETY_SLICES):
        return model                          # block: do no harm

    # 7. DEPLOY via canary, then fleet-wide OTA
    return canary_then_rollout(cand)

Notice that the loop never trusts itself blindly: outliers are scored to zero, auto-labels carry the work humans would otherwise do, and nothing reaches the fleet without clearing an evaluation gate. The loop's job is to keep the wheel turning and to refuse to ship a regression.

05Hard-example mining and sim-to-real

Closely related to active learning is hard-example mining: instead of asking which examples to label, ask which already-labeled examples the model keeps getting wrong, and train on them more. Mining the long tail of failures — the cases where the loss is highest — focuses gradient updates where the model is weakest, and counters the tyranny of the common case in which a million frames of empty lawn drown out the one frame with a child's toy in the grass.

Some edge cases are too rare or too dangerous to wait for in the field. Here the flywheel borrows from simulation. Domain randomization trains the model on synthetic scenes whose textures, lighting, camera poses, and clutter are randomized so aggressively that the real world looks, to the model, like just one more variation it has already seen.2 Tobin et al. showed this could transfer an object detector trained only on non-photorealistic simulation to real-world robotic control — first evidence that you can manufacture rare training data instead of waiting to record it. In a mowing context this means generating obstacles, weather, and sun angles on demand, then mixing that synthetic data with real logs so the model is robust before it ever meets the situation on a lawn.

06Versioning, lineage, and reproducibility

Once data is feeding models on a loop, an uncomfortable question appears: which data produced which model? Without an answer you cannot reproduce a result, cannot audit a regression, and cannot roll back safely. The flywheel is only trustworthy if it is reproducible, and that demands the same rigor for datasets that source control brought to code.

  • Dataset versioning. Every training set is an immutable, content-addressed snapshot — a hash over the exact examples and labels — so "model v37 was trained on dataset d91" is a precise, verifiable claim, not a vibe.
  • Lineage. Each label records its provenance: which run it came from, whether it was human- or auto-generated, which labeling model or rule produced it, and when. When a bad auto-labeling heuristic is discovered, lineage lets you find and purge every example it touched.
  • Reproducibility. A model artifact is pinned to its dataset hash, code commit, hyperparameters, and random seed. Re-running the recipe reproduces the model bit-for-bit — the precondition for trusting an evaluation at all.

This is unglamorous bookkeeping, and it is exactly the kind of "hidden technical debt" that quietly strangles ML systems when skipped.3 Sculley and colleagues catalogued how data dependencies, undeclared consumers, and feedback loops accrue maintenance cost that dwarfs the modeling itself — a warning that the pipeline, not the model, is where programs live or die.

07MLOps: gates, canaries, and OTA updates

A retrained model is a hypothesis, not an improvement, until it is proven on data it has never seen and shipped without breaking the fleet. The operational layer that does this — MLOps — is what makes the flywheel safe to run continuously rather than once a year by hand.

  • Evaluation gates. A candidate must beat the incumbent on a frozen, held-out evaluation set and must not regress on any protected slice — night scenes, wet grass, boundaries, the rare-object set. A model that improves the average while quietly getting worse near flowerbeds fails the gate. Aggregate metrics lie; sliced metrics are the gate.
  • Shadow deployment. The candidate runs alongside the production model on live data without acting on its outputs. You compare its decisions to the incumbent's at zero risk, surfacing disagreements as the next batch of examples to label.
  • Canary rollout. When the gate passes, the model goes to a small, representative slice of the fleet first. Its real-world failure and intervention rates are monitored against the rest of the fleet; only if it holds does it roll out widely.
  • Over-the-air updates & monitoring. Models ship to robots over the air, versioned and reversible, so a regression caught in canary is rolled back in minutes, not a truck-roll. Continuous monitoring of live metrics closes the loop — and feeds the next turn.

08Watching for drift

The world the model was trained on is not the world it will face next month. Seasons change, grass browns, the fleet enters new regions, the sun sits at new angles. Distribution shift — the slow divergence of live data from training data — silently erodes accuracy, and detecting it early is the difference between a graceful retrain and a field failure. The flywheel must monitor not just model outputs but the statistics of its inputs.

The standard lightweight tool is the Population Stability Index (PSI), a symmetrized form of the Kullback–Leibler divergence. Bin a feature (or a model score) into \(B\) buckets; let \(e_i\) and \(a_i\) be the expected (training) and actual (live) proportions in bucket \(i\). Then:

$$ \mathrm{PSI} = \sum_{i=1}^{B} \big(a_i - e_i\big)\,\ln\!\frac{a_i}{e_i}. $$

The rule of thumb that has survived from credit-risk monitoring into ML: \(\mathrm{PSI}<0.1\) is stable, \(0.1\le \mathrm{PSI}<0.25\) warrants attention, and \(\mathrm{PSI}\ge 0.25\) signals a major shift that likely demands a retrain. For a direct information-theoretic view of how far the live distribution \(P\) has moved from the reference \(Q\), the KL divergence itself is the underlying quantity,

$$ D_{\mathrm{KL}}\!\left(P \,\|\, Q\right) = \sum_{i} P(i)\,\log\frac{P(i)}{Q(i)} \;\ge\; 0, $$

zero only when the distributions match. (PSI is essentially \(D_{\mathrm{KL}}(P\|Q) + D_{\mathrm{KL}}(Q\|P)\) evaluated on the binned proportions, which is why it is symmetric and convenient as a single dashboard number.) When drift crosses a threshold, the system does not panic — it triggers targeted collection in the shifted region, which is to say it gives the flywheel a reason to turn.

09Centralized vs federated fleet learning

A fleet is a sensor network, and there are two ways to learn from it. The centralized approach hauls (filtered) data back to a data platform, where curation, labeling, training, and evaluation happen with full visibility — the model that comes out is shared by the whole fleet. This is simplest and gives the cleanest control over data quality and versioning; it is the default, and the right default for most ground robots whose data is not privacy-sensitive.

The federated alternative trains across robots without centralizing the raw data: each robot computes a model update on its own logs, and only the updates (gradients or weights), not the data, are aggregated centrally. This matters when data is sensitive (imagery of private property), when bandwidth is the binding constraint, or when regulation forbids moving raw data across borders. The tradeoff is engineering complexity and weaker data-quality control — you cannot inspect what you never centralize. In practice a fleet often runs a hybrid: centralize the rare, valuable, event-triggered slices; keep the bulk on-device; and let federated updates carry what cannot leave the robot.

10Why the loop is a moat

Models are increasingly commodities; the recipes are published and the open weights are good. What is hard to copy is a fleet that has been running long enough to have seen the long tail of the operating environment — and a pipeline disciplined enough to have captured, labeled, and learned from it. That asset compounds: more deployment yields more data yields better models yields more deployment, and a competitor starting today has to traverse the same loop from a standstill.

The return on each turn is not constant, though. Early data is enormously valuable because the model is bad and almost everything is informative; later, the marginal unlabeled frame is usually redundant, and value concentrates in the rare, the novel, and the failed. This is precisely why active learning, hard-example mining, and drift-triggered collection matter more as the program matures — they are how you keep extracting signal once random data has stopped paying. The flywheel is a moat not because data is scarce, but because the right data, well-processed, on a loop is.

11How this transfers to drones

The flywheel is indifferent to what kind of robot feeds it. Operations produce data; data trains models; better models improve operations — the wheel turns the same whether the platform rolls or flies, and almost every technique above ports without modification.

↪ Transfers to drones & aerial robots

An aerial fleet is, if anything, a purer data flywheel. The pipeline is the same — time-synced ingest, tiered upload of only the interesting flights, versioned datasets, eval-gated OTA model updates. Active learning on drone imagery is the obvious win: aerial surveys generate enormous image volumes, almost all of it redundant, so entropy/BALD acquisition to find the few frames worth a human's label is not a luxury but a necessity. Self-supervision transfers too — a drone's later, closer pass can label its earlier, distant view; a higher-resolution sensor can teach a lighter one. Domain randomization2 was proven on flying-robot perception. Drift detection by PSI or KL flags the season, region, or altitude the model has not seen. And federated fleet learning is often more compelling in the air, where individual flights may be bandwidth-limited or privacy-sensitive. Build the data platform for a fleet of mowers and you have built it for a fleet of drones; only the sensors on the front end change.

That is the throughline of NLR's program. We do not think of ourselves as building a mower — we build a fleet and the data platform that learns from it. The environment changes — snow, grass, altitude — but the loop is one loop, and a loop that has been turning longer is the hardest thing in robotics to catch.

Sources & further reading

  1. Settles, B. "Active Learning Literature Survey." Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. The standard survey of query strategies, including uncertainty sampling, entropy acquisition, and expected error reduction.
  2. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." IEEE/RSJ IROS, 2017. arXiv:1703.06907
  3. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems (NeurIPS), 2015. The canonical account of data dependencies, feedback loops, and pipeline debt in production ML.
  4. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., & Ré, C. "Snorkel: Rapid Training Data Creation with Weak Supervision." Proceedings of the VLDB Endowment, 2017. doi:10.14778/3157794.3157797 — programmatic weak supervision for combining noisy labeling sources.
  5. Gal, Y., Islam, R., & Ghahramani, Z. "Deep Bayesian Active Learning with Image Data." ICML, 2017. arXiv:1703.02910 — BALD-style acquisition with model uncertainty for deep nets.
← All research Back to all research →