Teaching a robot where the lawn ends

Older robotic mowers solved the boundary problem with a shovel. You buried a loop of wire around the lawn, ran a low-frequency current through it, and the mower listened for the magnetic field with a coil in its nose. Cross the wire and the field flips phase; the robot turns around. It is a beautiful, dumb, reliable solution — and it takes a weekend to install, dies when a landscaper's edger nicks the loop, and cannot be changed without trenching the yard again.

Wire-free mowing throws away the shovel. The robot now has to answer the wire's question — am I still on the lawn? — from sensors alone, dozens of times a second, in sun and shade and morning dew. This article works through how a ground robot learns its boundary from a camera: classical cues, learned semantic segmentation, the geometry that turns a labelled image into a metric map, and the planning contract that turns that map into a fence. As with everything in NLR's stack, the machinery is not specific to grass — it is the same perception pipeline that lets a drone delineate a field from the air.

01The wire, and the policy that replaces it

It helps to be precise about what a "boundary" even is. The wire encoded a hard geometric fact: a closed curve in the ground plane. But the thing we actually want to enforce is a policy — a decision rule that says, for every patch of ground the robot might drive onto, whether driving there is allowed. The lawn edge is only the most common reason a patch is forbidden. A flowerbed, a gravel path, a pond, the neighbour's grass, a sleeping dog: all are "not-lawn" for the purposes of the policy, and none of them is a wire.

So the wire-free problem decomposes into three questions, and it is worth keeping them separate:

Perception — what is the ground made of, in every direction I can see? This is a per-pixel classification problem, and it is where semantic segmentation lives.
Geometry — where, in metres on the ground plane, is the line between mowable and forbidden? This is a projection problem that maps image pixels to a bird's-eye coordinate frame.
Policy — given an uncertain, time-varying map of mowable ground, what fence does the planner treat as inviolable? This is where perception meets safety.

The wire collapsed all three into one buried fact. Doing without it means solving each explicitly — and, crucially, carrying uncertainty through all three, because a vision system that is 70% sure a pixel is lawn must not be treated the same as the wire's binary certainty.

Key idea

The boundary is not a line the robot detects — it is a policy the robot enforces. Perception produces a probability that each patch of ground is mowable; geometry places those patches in metric space; planning turns the confident-mowable region into a fence. Treat "find the edge" as one monolithic detector and you will chase phantom boundaries every time a shadow falls across the grass.

02Classical colour and texture cues

The oldest vision approach is to exploit the fact that grass is, well, green and fuzzy. Convert the image out of RGB into a colour space that separates chroma from brightness — HSV or CIE-Lab — and grass occupies a fairly compact region of hue and saturation that is reasonably stable as the sun moves. A simple classifier thresholds each pixel: is its colour inside the "grass" region? Add a texture descriptor — grass has high-frequency, oriented micro-structure that a patch of asphalt or mulch does not — and you can separate surfaces that happen to share a colour. Classic texture features (local binary patterns, Gabor filter responses, the entropy of a local neighbourhood) feed a per-pixel classifier such as an SVM or random forest.

This works astonishingly well in the demo video and falls apart in the field. The grass-colour region drifts with the white balance of the camera, with cloud cover, with the yellow cast of low evening sun, and with the season — spring grass and August drought-grass are different colours. Worse, the failure mode is silent: the threshold keeps producing confident answers, they are just wrong. Hand-tuned colour cues are a fine prior and a terrible policy. The lesson the field learned, here as in localization, is that hand-engineering the decision boundary in feature space does not generalize. What generalizes is learning the decision boundary from labelled examples — which is exactly what semantic segmentation does.

03Semantic segmentation networks

Semantic segmentation is the task of assigning every pixel in an image a class label — lawn, bed, path, obstacle, sky — rather than producing a single label for the whole image. The breakthrough was Long, Shelhamer and Darrell's Fully Convolutional Network (FCN),1 which observed that a classification CNN is already a dense feature extractor: replace its fully-connected layers with convolutions and the network maps an image of any size to a grid of per-pixel class scores. Because pooling shrinks that grid, FCN learns to upsample the coarse prediction back to full resolution with transposed convolutions, and fuses in features from earlier, finer layers so that edges land in the right place.

That encoder–decoder skeleton became the dominant design. The encoder is a downsampling stack that trades spatial resolution for semantic depth — it learns what is in the scene. The decoder upsamples back to pixel resolution and recovers where the boundaries are. Two refinements matter for our problem:

U-Net2 made the encoder–decoder symmetric and added dense skip connections: every decoder stage receives the matching-resolution encoder feature map, concatenated in, so fine boundary detail survives the round trip through the bottleneck. Originally built for biomedical images with very few training examples, U-Net is forgiving with small datasets — which is precisely the regime an outdoor robotics team operates in.
DeepLab3 attacked the resolution loss differently, with atrous (dilated) convolution. A dilated kernel inserts gaps between its taps, enlarging the receptive field without downsampling and without adding parameters. Stacking dilations, plus atrous spatial pyramid pooling to look at several scales at once, lets the network keep a denser internal feature map and reason about context — useful when a thin gravel path must be told apart from a wide one.

A dilated convolution is just a normal convolution that samples its input on a lattice spaced by a dilation rate $r$. In one dimension, with kernel weights $w[k]$ over taps $k$,

$$ y[i] \;=\; \sum_{k} x\big[i + r\,k\big]\,w[k]. $$

Setting $r=1$ recovers an ordinary convolution; $r=2$ doubles the field of view while touching the same number of weights. That is the whole trick that lets DeepLab see widely without going blind to detail. For a mower we do not need the heaviest model — a compact encoder–decoder with a MobileNet-class backbone runs in real time on an embedded GPU and is plenty for a five-class lawn ontology.

Fig. 1 — Encoder–decoder segmentation. The encoder downsamples to learn what is present; the decoder upsamples to recover where the boundaries lie. Skip connections splice fine encoder features into the decoder so edges land precisely.

04The labelling problem and class imbalance

A segmentation network is only as good as its labels, and pixel-accurate labels are expensive. Annotating one image means painting a class onto every pixel — minutes of careful work per frame, and a single lawn deployment generates millions of frames. Three realities shape how NLR builds its training set:

The long tail is what matters. The vast majority of pixels in any lawn image are obviously grass or obviously sky. The pixels that decide the robot's behaviour are the rare ones: the exact edge where lawn meets bed, the thin line of a path, a half-occluded toy. Random sampling buries these. We deliberately over-sample boundary regions and hard cases, and we mine failures from the field — frames where the live robot's prediction disagreed with a confident prior get flagged, labelled, and fed back in.
Class imbalance distorts training. If 90% of pixels are "lawn," a model can score 90% accuracy by predicting "lawn" everywhere and never learning the boundary at all. This is the central pathology of segmentation training, and it is why per-pixel accuracy is a near-useless headline metric (more on that below).
Synthetic and augmented data stretch the budget. Aggressive augmentation — photometric jitter to simulate sun and white-balance shifts, random shadows, simulated dew — multiplies the value of each hand-labelled frame. U-Net's original result was, famously, a demonstration that heavy augmentation lets a segmentation net learn from very few annotated images,2 a lesson we lean on hard.

05Losses: cross-entropy, IoU and Dice

The network outputs, at each pixel $i$, a vector of class scores (logits) $z_i$. The softmax turns those into a probability distribution over the $C$ classes,

$$ p_{i,c} \;=\; \frac{\exp\!\big(z_{i,c}\big)}{\displaystyle\sum_{k=1}^{C}\exp\!\big(z_{i,k}\big)}, $$

and the standard training objective is per-pixel cross-entropy against the one-hot ground-truth label $y_{i,c}$, averaged over the $N$ pixels:

$$ \mathcal{L}_{\mathrm{CE}} \;=\; -\,\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\,\log p_{i,c}. $$

Cross-entropy is smooth and easy to optimize, but it treats every pixel equally — and so it inherits the class-imbalance pathology directly. The fix is to optimize, or at least co-optimize, a metric that rewards overlap of the predicted and true regions rather than raw pixel counts. The natural one is Intersection-over-Union (the Jaccard index), and its close cousin the Dice coefficient. For a class with predicted soft mask $p$ and ground-truth mask $g$, a differentiable soft-Dice loss is

$$ \mathcal{L}_{\mathrm{Dice}} \;=\; 1 \;-\; \frac{2\sum_i p_i\,g_i + \varepsilon}{\sum_i p_i + \sum_i g_i + \varepsilon}, $$

where the small $\varepsilon$ keeps the expression well-behaved when a class is absent from the tile. Because Dice and IoU are ratios of overlap to union, a class that occupies 2% of the pixels contributes as much as one that occupies 80% — the boundary stops being drowned out. In practice we train with a weighted sum of cross-entropy (for stable gradients early) and a region-overlap loss (to fix imbalance), which is the recipe that consistently produces crisp, well-placed lawn edges.

Pseudocode — segmentation inference → BEV projection → boundary

# Per frame: image -> mowable probability -> metric ground map -> geofence
def boundary_from_frame(img, H, prior_map):
    # --- 1. SEMANTIC SEGMENTATION (image space) ---
    logits = segnet(img)                       # C x h x w class scores
    prob   = softmax(logits, axis=0)          # per-pixel class probs
    mow    = prob[LAWN]                         # P(pixel is mowable), in [0,1]

    # --- 2. INVERSE-PERSPECTIVE MAPPING (image -> ground plane) ---
    # H is the 3x3 ground-plane homography from calibration + pose
    bev = warp_homography(mow, H, out=GROUND_GRID)   # bird's-eye P(mowable)

    # --- 3. TEMPORAL FUSION (occupancy-style log-odds) ---
    prior_map += log_odds(bev) - log_odds(0.5)   # accumulate evidence
    p_mow      = sigmoid(prior_map)                  # fused P(mowable) per cell

    # --- 4. GEOFENCE (confident-mowable region, eroded for safety) ---
    mowable  = p_mow > TAU                            # e.g. TAU = 0.80
    mowable  = erode(mowable, margin=SAFETY_M)        # pull the fence inward
    fence    = extract_contour(largest_region(mowable))
    return fence, prior_map

Note step 3: a single frame is noisy, so we never act on one. Each frame's bird's-eye estimate is folded into a persistent ground-grid as evidence — the log-odds update is the same occupancy-grid bookkeeping a robot uses for obstacles — and the planner reads the fused, time-averaged probability. A shadow that fools one frame is outvoted by the dozen clean frames around it.

06From pixels to a metric ground map

A segmentation mask lives in image space — it tells you a pixel is lawn, but not where on the ground that pixel is. The planner needs metres, in a top-down frame. The bridge is a piece of projective geometry that holds whenever the scene of interest is (approximately) a plane: a homography.4

A pinhole camera maps a world point to a pixel through its intrinsics $K$, rotation $R$ and translation $t$. When all the points we care about lie on a single plane — the lawn, which is flat to a good approximation over the few metres in front of the robot — the general projection collapses into an invertible $3\times3$ map between the ground plane and the image plane. In homogeneous coordinates, a ground point $\tilde{X}=[X,\,Y,\,1]^\top$ and its pixel $\tilde{x}=[u,\,v,\,1]^\top$ are related by

$$ \tilde{x} \;\simeq\; H\,\tilde{X}, \qquad H = K\big[\,r_1 \;\; r_2 \;\; t\,\big], $$

where $r_1, r_2$ are the first two columns of $R$ (the third drops out because $Z=0$ on the plane) and $\simeq$ denotes equality up to scale. Because $H$ is invertible, we can run it backwards — inverse-perspective mapping (IPM) — warping the perspective mask into a metric bird's-eye view where every cell is a fixed patch of ground:

$$ \tilde{X} \;\simeq\; H^{-1}\,\tilde{x}. $$

We recover $H$ from a one-time camera calibration (which gives $K$) combined with the robot's live pitch and roll from the IMU and the camera's mounting height — the same pose estimate the localization stack already maintains. The output is a top-down occupancy/boundary map: each cell carries the probability that the corresponding square of real ground is mowable. The flat-world assumption is the catch — on a slope or over a hump the homography distorts range — so we cap IPM to the near field where the planar approximation holds and lean on stereo or learned depth where the ground is not flat.

Why bird's-eye, not image space

Planning, geofencing, and coverage are all metric, top-down problems — they care about metres on the ground, not pixels in a tilted camera. Projecting segmentation into a bird's-eye grid is what lets uncertainty accumulate over time in a stable frame, lets multiple cameras fuse into one map, and hands the planner a fence in the units it actually plans in.

07From segmented map to geofence

A probabilistic mowable-map is not yet a boundary the planner can obey — it is a field of probabilities. Turning it into a hard fence is a deliberate, conservative step:

Threshold on confidence, not on the mean. A cell joins the mowable region only when its fused probability clears a high bar (we use a threshold around $0.8$, not $0.5$). When in doubt, the ground is forbidden. This asymmetry is the entire safety posture: the cost of mowing a flowerbed vastly exceeds the cost of leaving a thin strip of grass uncut.
Erode for a safety margin. We shrink the confident region inward by a fixed margin — accounting for the robot's footprint, control error, and the blade's reach — so the geofence sits a hand's-width inside the true edge. The mower trims close on a later pass with tighter evidence, never by gambling on the first.
Extract a clean contour and persist it. The eroded region's outline becomes a polygon the planner treats as a hard constraint, identical in status to a buried wire. Because a mower revisits the same lawn constantly, this learned boundary is saved and reloaded: the robot refines a persistent geofence over many sessions rather than rediscovering the lawn from scratch each time, and a human can review and lock it.

The planner consumes the polygon exactly as it once consumed the wire signal — the rest of the autonomy stack does not know or care that the fence came from a neural network instead of a coil of copper. That clean interface is deliberate: it keeps the learned, fallible perception system firmly on one side of a contract, with a simple geometric guarantee on the other.

08Robustness: shadows, dew, low sun, seasons

The reason this is hard — the reason it is a research programme and not a weekend project — is that a lawn looks radically different hour to hour and month to month, while the boundary itself does not move. The adversaries are mostly photometric:

Shadows and dappled light. A tree's shadow draws a sharp, high-contrast edge across uniform grass — exactly the kind of edge a naive detector mistakes for a boundary. A network trained on shadowed and dappled scenes learns that the texture on both sides of a shadow is identical and the surface is unchanged; a colour threshold never can.
Dew and wetness. Morning dew changes grass's reflectance and adds specular glare; wet pavement darkens toward the colour of wet soil. We augment heavily with simulated wetness and collect dawn data deliberately.
Low sun and glare. A sun low on the horizon throws long shadows, washes the image with warm colour, and can flare the lens. High-dynamic-range capture and exposure-robust training help; so does the bird's-eye temporal fusion, which lets a few well-exposed frames carry a few blown-out ones.
Seasonal colour. Spring's saturated green, summer's drought-yellow, autumn's leaf litter: the colour of "lawn" is not constant, so we never let the model rely on colour alone. Training data spans seasons, and the texture and contextual cues a deep network learns are far more stable than hue.

None of these is solved by a cleverer threshold. They are solved by learning the invariances from data that contains them, and by never trusting a single frame — the two principles that run through the whole stack.

09Measuring whether it works

Per-pixel accuracy is a trap, for the reason given above: predict "lawn" everywhere on a mostly-lawn image and you score brilliantly while being useless. The field's standard region metric is mean Intersection-over-Union (mIoU) — the IoU of prediction and ground truth, averaged over classes so the rare ones count, popularized as the scoring rule of the PASCAL VOC segmentation benchmark.5 For a single class it is the overlap ratio

$$ \mathrm{IoU} \;=\; \frac{|\,P \cap G\,|}{|\,P \cup G\,|} \;=\; \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP} + \mathrm{FN}}, $$

with TP, FP, FN the per-pixel true positives, false positives and false negatives. But mIoU rewards getting the bulk of a region right and is nearly blind to whether the edge is crisp — and for a mower the edge is the entire product. So we also track a boundary F-score: extract the contour of prediction and ground truth, and compute precision and recall of boundary pixels that match within a small tolerance band, combined as their harmonic mean.6 A model can have excellent mIoU and a ragged, untrustworthy boundary; the boundary F-score is what catches it. As with localization, the only evaluation that counts is on real field logs — dawn dew, August glare, leaf-litter October — not on a clean afternoon test set.

10How this transfers to drones

Nothing in this pipeline is specific to a mower. "Segment the ground into permitted and forbidden regions, project the result into a metric top-down map, and hand the planner a geofence" is a general recipe for any robot that has to respect a spatial boundary it was not handed in advance.

↪ Transfers to drones & aerial robots

An agricultural drone faces the same problem from above: aerial semantic segmentation labels each pixel of a downward view as crop, soil, road, or water, and field-boundary delineation turns the crop region into a polygon for spraying or mapping — the precision-agriculture twin of finding the lawn edge. The geometry is even cleaner from the air: looking straight down at flat farmland, the ground-plane homography that warps a frame into a metric orthomap is near-exact, so IPM produces a faithful bird's-eye boundary directly. The same FCN/U-Net/DeepLab backbones, the same imbalance-aware Dice loss, the same mIoU and boundary-F evaluation carry over unchanged. And the polygon the drone produces is a geofence in the most literal sense — the no-fly / no-spray boundary its flight planner must respect. Learn to find the edge of a lawn and you have learned to find the edge of a field.

This is the throughline of NLR's perception research: the surface changes — grass, crop, snow — but "decide what the ground is, place it in metric space, and turn it into a fence" is one problem. NLR's answer is layered and conservative: a learned encoder–decoder for perception, classical homography for geometry, occupancy-style temporal fusion for stability, and an eroded, human-reviewable geofence for safety — every stage carrying its own uncertainty, so the planner always knows how much to trust the edge it is driving toward.

Sources & further reading

Long, J., Shelhamer, E., & Darrell, T. "Fully Convolutional Networks for Semantic Segmentation." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440. The paper that established dense per-pixel prediction with encoder–decoder CNNs. arXiv:1411.4038
Ronneberger, O., Fischer, P., & Brox, T. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI, 2015. Symmetric encoder–decoder with skip connections; the canonical small-data segmentation architecture. doi:10.1007/978-3-319-24574-4_28
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs." IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 2018, pp. 834–848. doi:10.1109/TPAMI.2017.2699184
Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision. 2nd ed., Cambridge University Press, 2004. The standard reference for planar homographies and the projective geometry behind inverse-perspective mapping (see the chapter on scene planes and homographies).
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. "The PASCAL Visual Object Classes (VOC) Challenge." International Journal of Computer Vision, 88(2), 2010, pp. 303–338. Origin of the mean intersection-over-union segmentation metric in wide use today. doi:10.1007/s11263-009-0275-4
Csurka, G., Larlus, D., & Perronnin, F. "What is a Good Evaluation Measure for Semantic Segmentation?" British Machine Vision Conference (BMVC), 2013. Introduces the boundary-aware F-measure for contour quality, complementing region IoU.

← All research Next: covering every blade →