Seeing the child before the blade

Most engineering problems are negotiations. You trade a little accuracy for a little speed, a little cost for a little margin, and the world forgives you when the balance is slightly off. The problem in this article is not like that. A robotic mower drives a blade tip moving at well over fifty metres per second through the exact spaces where small children play and pets sleep. The requirement — detect a person or animal in the path and stop the blade before contact — has no acceptable failure rate that a parent would recognize as acceptable. Everything else NLR builds is in service of getting this one thing right.

This article walks through how that guarantee is actually constructed: the perception pipeline that finds a person or animal in a camera frame, the depth sensing that tells us how far away they are, the latency budget that decides whether we can stop in time, and the layered functional-safety architecture that ensures no single broken sensor or model can ever leave the blade running toward someone. The throughline is a principle worth stating up front: safety is not a feature you add to perception — it is an architecture you wrap around it, because perception alone will eventually be wrong.

01The one requirement that cannot fail

In detection, two kinds of error exist. A false positive is a phantom: the robot stops for a shadow, a clump of cut grass, a plastic bag. The cost is an annoyed owner and an unfinished stripe. A false negative is the other kind: a real person or animal is in the path and the system does not register them. In a normal vision application these two errors trade off against each other and you pick a sensible balance. Here they are not remotely symmetric. A thousand false stops are a nuisance; a single false negative is a catastrophe. The entire design is therefore deliberately, unapologetically skewed: we accept many nuisance stops to drive the probability of a missed person toward zero.

That asymmetry has a second consequence. Because we can never prove a learned perception model has a zero miss rate, the model is never allowed to be the only thing standing between the blade and a person. The deep network is the first line of defense and the smartest one — but it sits inside layers of simpler, independently-failing mechanisms whose combined miss probability is far smaller than any one of them. That is the idea of defense in depth, and the rest of this article is mostly about building those layers correctly.

Key idea

A learned detector can be excellent and still occasionally wrong. So the safety case never rests on the detector alone. It rests on several diverse mechanisms failing independently — vision, depth, contact, and lift/tilt — arranged so that the blade stops if any of them fires. The math of independent failures is what turns a 99-point-something detector into a system you can put near a child.

02The detection pipeline

The perception front end has one job: turn a stream of camera frames into a timely, well-localized answer to "is there a person or animal in or near my path, and where?" The dominant modern approach is a convolutional neural network (CNN) trained for object detection — predicting both a class label and a bounding box for every relevant object in the image.

For a robot the choice between detector families is mostly a choice about latency. Two-stage detectors (the R-CNN lineage) first propose regions, then classify them; they are accurate but comparatively slow. One-stage detectors — YOLO and SSD being the archetypes — collapse detection into a single forward pass that regresses boxes and class probabilities directly from the image. Redmon et al.'s YOLO1 framed detection as one regression problem over a grid and ran in real time; SSD2 added multi-scale feature maps so that objects of very different sizes are handled in the same pass. For a moving machine that must decide every few tens of milliseconds, the one-stage family is the natural fit: predictable, bounded inference time matters more than the last point of average precision.

On top of detection we add instance segmentation for the cases that matter most. A bounding box around a crawling child is loose and ambiguous; a pixel-accurate mask tells us precisely which pixels are person and which are lawn, which sharpens the depth estimate and the distance-to-contact. Mask R-CNN3 is the canonical method here — it adds a segmentation branch to the detector so the system outputs masks alongside boxes. NLR runs detection on every frame and segmentation on the near-field region of interest, where geometry is most safety-critical.

03Small, occluded, and partly-seen subjects

The easy case — an adult standing upright in good light — is nearly solved. The cases that keep us up at night are the ones that violate the assumptions detectors are usually trained on:

Small and low to the ground. A crawling infant occupies a fraction of the pixels of a standing adult and presents an unfamiliar silhouette. Detectors lose recall on small objects; the standard mitigations — multi-scale feature pyramids, high input resolution near the ground plane, and training data deliberately rich in low-to-the-ground subjects — are not optional here, they are the requirement.
Occluded and partially visible. A child may be half-behind a shrub; a leg or an arm is all that enters the frame. The detector has to fire on a part of a person, not insist on the whole. We train explicitly on partial views and tune the operating point so that a confident detection of a limb is enough to stop.
Camouflaged animals. A tabby cat or a sleeping dog in tall, dappled grass is close to the worst case for a vision model — low contrast, broken outline, texture that mimics the background. This is exactly where a single sensor is not enough, and where depth and contact layers earn their place.

The honest summary is that no detector handles all of these with the reliability a blade demands. That is not a reason to tune the model harder and hope; it is the reason the architecture never depends on the model being right.

04From a box to a distance

Knowing that a person is in frame is not enough — we need to know how far, because distance sets the time we have. A bounding box is 2-D; recovering range requires depth. Three sensing modes are common, and NLR uses them in combination:

Stereo vision. Two cameras a known baseline $B$ apart see the same point at slightly different horizontal positions. That difference — the disparity $d$ — yields depth by triangulation: \[ Z = \frac{f\,B}{d}, \] where $f$ is the focal length in pixels. Note the consequence: depth error grows with the square of distance for a fixed disparity error, so stereo is most accurate exactly where it matters most — close to the machine.
Time-of-flight (ToF) / depth cameras. These measure range directly by timing emitted light, giving a dense depth image that does not depend on visual texture — valuable against that low-contrast cat in the grass.
Fusing depth with the mask. Project the instance mask onto the depth map and you get the distance to the nearest pixel of a confirmed person or animal, not the distance to the ground in front of them. That nearest-point distance is what the stopping calculation consumes.

Depth also acts as a sanity check on the detector. An object the detector calls a "person" that the depth map says is two metres tall and ten metres away has a consistent geometry; a "person" that is the size of a leaf at half a metre is a false positive the geometry can reject. Cross-checking class against scale is a cheap, powerful filter.

05The latency budget and time-to-contact

Detection only helps if it happens soon enough. The governing quantity is time-to-contact: how long until the blade reaches the hazard. If the robot closes the gap to a person at relative speed $v$ and the nearest part of that person is at distance $Z$, the time available is

$$ t_{\text{contact}} = \frac{Z}{v}. $$

Against that we must spend a chain of delays before the blade actually stops. Total response time is the sum of the time to see, the time to decide, and the time to act:

$$ t_{\text{stop}} = t_{\text{sense}} + t_{\text{infer}} + t_{\text{decide}} + t_{\text{actuate}}, $$

where $t_{\text{sense}}$ is camera exposure and readout, $t_{\text{infer}}$ is the network forward pass, $t_{\text{decide}}$ is fusion and gating, and $t_{\text{actuate}}$ is the mechanical time for the blade brake and drive to bring everything to rest. The safety condition is simply that we finish in time, with margin $t_{\text{margin}}$ to spare:

$$ t_{\text{sense}} + t_{\text{infer}} + t_{\text{decide}} + t_{\text{actuate}} \;+\; t_{\text{margin}} \;\le\; \frac{Z}{v}. $$

Read this inequality as a design contract. Each term on the left is a budget you must hold under the worst case, not the average. The right side shrinks as the machine moves faster or the subject appears closer, so the system enforces the inequality by controlling the right side too: it caps travel speed so that there is always enough distance to stop within budget, and it slows further as detected objects approach. A useful way to express the stopping distance directly — reaction distance plus braking distance — is

$$ s_{\text{stop}} = \underbrace{v\,(t_{\text{sense}}+t_{\text{infer}}+t_{\text{decide}})}_{\text{reaction distance}} \;+\; \underbrace{\frac{v^2}{2a}}_{\text{braking distance}}, $$

with $a$ the achievable deceleration of the platform and the blade brake. Setting $s_{\text{stop}}\le Z_{\text{detect}}$ — the range at which we can reliably detect — and solving for $v$ gives the maximum safe speed. The quadratic $v^2/2a$ term is why a small increase in speed costs a disproportionate increase in stopping distance, and why NLR's platforms run deliberately slowly near anything they cannot see through.

Pseudocode — detect → classify → brake safety loop

# Runs every cycle. ANY layer can command STOP.
# Fail-safe: unknown / error / timeout  ->  STOP, not GO.
def safety_step(frame, depth, contact, tilt, state):
    # --- diverse-redundant channels, evaluated independently ---
    dets   = detector(frame)                  # CNN: persons / animals + boxes
    masks  = segment(frame, roi=NEAR_FIELD)   # pixel masks in the danger zone
    hazard = fuse(dets, masks, depth)         # nearest confirmed subject + range Z

    # --- channel 1: perception + geometry ---
    if hazard and stop_distance(state.v) >= hazard.Z - MARGIN:
        return STOP("perception")        # can't stop in time -> stop now

    # --- channel 2: independent contact bumper ---
    if contact.triggered:
        return STOP("bump")

    # --- channel 3: lift / tilt -> blade must already be stopping ---
    if tilt.angle > TILT_MAX or tilt.lifted:
        return STOP("lift_tilt")      # safety-rated, hardware-backed

    # --- watchdog: stale perception is a fault, treat as hazard ---
    if now() - dets.timestamp > PERCEPTION_TIMEOUT:
        return STOP("watchdog")

    return GO(speed=speed_limit(hazard))   # slow down as subjects near

Two properties of this loop are deliberate. First, every channel can independently command STOP, and nothing can override a stop — stops compose by logical OR. Second, the default on any anomaly (stale frames, an error, a timeout, an unknown reading) is to stop, not to continue. A safety loop that keeps mowing when it is confused is not a safety loop.

Fig. 1 — The stopping-distance budget. Perception latency, decision, and actuation must all complete before time-to-contact $Z/v$, with margin to spare. Faster travel or a closer subject moves the contact line left, so the planner caps speed to keep the inequality true.

06Diverse redundancy: no single failure is dangerous

Redundancy by duplication — two copies of the same camera, the same model — protects against a part breaking but not against the part being wrong in a way both copies share. Two identical detectors are fooled by the same camouflaged cat. The principle that matters for safety is diverse redundancy: combine mechanisms that fail for different reasons, so that whatever defeats one is unlikely to defeat the others. NLR layers four:

Vision — the CNN detector and segmenter. Smart, long-range, but fooled by occlusion, low contrast, and novelty.
Depth — stereo and ToF. Geometry-based, indifferent to texture, cross-checks the vision class against physical scale.
Contact — a physical bumper. Dumb, slow, but utterly independent of any model; if something is touched, the blade stops regardless of what the cameras believe.
Lift / tilt — inertial and switch sensing. If the machine is picked up or tips past a threshold — a child lifting it, a pet bumping it over — the blade is commanded to stop immediately, on a hardware-backed path that does not wait for software.

The power of diversity is quantifiable. If a hazard slips past each independent layer $k$ with probability $p_k$, and the layers fail independently, the probability that all of them miss is the product

$$ P_{\text{miss}} = \prod_{k} p_k. $$

Four layers each missing one time in fifty — individually unremarkable — combine, if truly independent, to a joint miss on the order of $50^{-4}$, roughly one in six million. Independence is the load-bearing word: the engineering work is in choosing mechanisms whose failure modes genuinely do not overlap, and in resisting the false comfort of stacking sensors that all fail in the same fog.

07Functional safety: PL, categories, and ISO 18497

So far this is good engineering. Functional safety is the discipline that makes it auditable — a body of standards for how safety-related control systems are designed, rated, and proven. For machinery the central standard is ISO 13849-1,4 which rates a safety function by its Performance Level (PL), a band from PLa (lowest) to PLe (highest) tied to the probability of a dangerous failure per hour ($\mathrm{PFH_d}$). A blade-stop on a machine that operates around uninvolved people is a high-consequence function, so it targets the upper bands — PLd or PLe — which in turn constrain the architecture.

PL is not earned by good intentions; it is determined by concrete design properties: the category of the control architecture (B, 1, 2, 3, 4 — escalating in redundancy and self-diagnosis), the mean time to dangerous failure of components, the diagnostic coverage, and protection against common-cause failure. The highest levels effectively require redundant, self-checking channels with no single point of failure — the same conclusion we reached from the probability argument, now written as a standard. A safety-rated monitored stop means the stopped state is actively supervised, not merely commanded: the system continuously confirms the blade is actually at rest and faults safe if it is not.

For our domain specifically there is ISO 18497,5 the safety standard for highly automated and autonomous agricultural machinery. Its 2024 revision is a multi-part series; Part 2 addresses the design of obstacle-protection systems — exactly the person-and-animal detection problem — and the series sets out verification and validation principles for autonomous operation. ISO 18497 gives us the domain requirements (what an obstacle-protection system must do and how to validate it); ISO 13849 gives us the control-system rating (how reliable the stop function must be). Together they turn "we tried hard" into "here is the rated, documented, independently-assessable safety case."

08Tuning for the cost of a miss

To engineer the perception layer against this standard we need numbers. Two govern a detector's operating point. Precision is the fraction of alarms that are real; recall (the true-positive rate) is the fraction of real hazards we catch:

$$ \text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}. $$

Every detector exposes a confidence threshold that trades these off, sweeping out a curve. Lower the threshold and recall rises (we catch more real people) while precision falls (more false alarms). For most applications you balance the two — the $F_1$ score, their harmonic mean, is the usual target. For a blade we do not balance them. We weight recall far above precision, using an $F_\beta$ score with $\beta\gg1$ so that missing a person is penalized far more heavily than a false stop:

$$ F_\beta = (1+\beta^2)\,\frac{\text{Precision}\cdot\text{Recall}}{\beta^2\,\text{Precision} + \text{Recall}}. $$

Concretely, we set the operating point at a recall target that approaches one — sweeping the ROC and precision–recall curves to find the threshold that delivers near-total recall on people and animals — and we accept the false-positive rate that comes with it. The nuisance stops are real and we work to reduce them, but never by raising the threshold in a way that risks a miss. The geometry cross-check from Section 04 and the depth and contact layers are what let us run at this aggressive operating point without the machine stopping every few seconds: a vision false positive that depth and scale contradict is filtered before it ever stops the mower, while a true detection is confirmed and obeyed.

↪ Transfers to drones & aerial robots

An aircraft's version of "see the child before the blade" is detect-and-avoid (also called sense-and-avoid): spot another aircraft, a bird, or an obstacle and manoeuvre clear with performance equivalent to a human pilot. The ingredients are the same — perception under a hard latency budget, range estimation, diverse-redundant sensing, and a fail-safe default — but the consequences of a miss play out in three dimensions and seconds. Drones add their own safety nets: geofencing to keep the vehicle inside an approved volume, return-to-home and controlled descent as the fail-safe when a link or sensor drops, and a risk-based approval framework — the JARUS SORA (Specific Operations Risk Assessment) — that grades an operation's ground and air risk and dictates the mitigations required,6 much as ISO 18497 grades an autonomous ground machine. Detect-and-avoid sensing builds on the same one-stage detectors and depth reasoning described here; the functional-safety mindset — rate the function, remove single points of failure, fail safe — transfers without modification.

09Validation by scenario coverage

A rated architecture still has to be shown to work, and average accuracy on a benchmark does not show it — the cases that hurt are the rare ones in the tail. NLR validates by scenario coverage: an explicit, growing catalogue of the situations that must be handled, each tested to a pass criterion. The catalogue is deliberately stocked with the hard cases — a child-sized mannequin crawling into the path from behind a shrub; a dark animal lying still in tall grass; a subject in low sun, in rain, in dappled shade; the machine lifted mid-cut; the camera blinded by glare. Each scenario is run across speeds, approach angles, and lighting until the stop is demonstrated with margin, and every field incident or near-miss becomes a new permanent entry, so the suite only ever grows. Crucially, validation spans the whole stack, not just the model: a perception miss that the contact bumper catches is still a pass for the system, because the system, not the network, is what must be safe. This scenario-based discipline is the validation philosophy the ISO 18497 series codifies for autonomous machines.

10How this transfers to drones

The deepest transfer is not a sensor or an algorithm — it is the stance. Build the system so that being wrong is survivable: assume the smart component will fail, surround it with diverse mechanisms that fail for different reasons, make the default action safe, and rate the whole thing against a standard so the safety case is auditable rather than aspirational. That stance is identical whether the moving hazard is a mower blade near a toddler or a quadrotor near a bystander.

NLR treats safety as the part of autonomy that earns the right to deploy everything else. A robot that can localize, map, and plan beautifully but cannot be trusted around a child has not earned its place in a backyard — and the same robot, flying, has not earned its place over a crowd. Get this layer right, on the ground, with a blade and a sleeping dog, and you have built the discipline that lets the rest of the fleet — wheeled or winged — operate where people are.

Sources & further reading

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. "You Only Look Once: Unified, Real-Time Object Detection." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi:10.1109/CVPR.2016.91
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A.C. "SSD: Single Shot MultiBox Detector." European Conference on Computer Vision (ECCV), 2016. doi:10.1007/978-3-319-46448-0_2
He, K., Gkioxari, G., Dollár, P., & Girshick, R. "Mask R-CNN." IEEE International Conference on Computer Vision (ICCV), 2017. doi:10.1109/ICCV.2017.322
ISO 13849-1:2023, Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. International Organization for Standardization. Defines Performance Levels (PLa–PLe) and architectural categories for safety functions.
ISO 18497-2:2024, Agricultural machinery and tractors — Safety of partially automated, semi-autonomous and autonomous machinery — Part 2: Design principles for obstacle protection systems. International Organization for Standardization. (Supersedes ISO 18497:2018, "Safety of highly automated agricultural machines.")
JARUS, SORA — Specific Operations Risk Assessment (Joint Authorities for Rulemaking of Unmanned Systems). The risk-based framework for approving drone operations, including ground/air risk classes and detect-and-avoid mitigations. See also RTCA DO-365 on detect-and-avoid system performance.

← All research Next: staying upright on a wet slope →