Automated Spike Removal in Sonar Datasets

Acoustic spikes in multibeam (MBES) and single-beam (SBES) sonar returns — isolated depth excursions, multipath reflections, and transducer ringing — corrupt the seafloor surface and propagate systematic error into slope, rugosity, dredge-volume, and hydrodynamic products. Manual culling is non-reproducible and does not scale to agency-grade survey volumes. This page sits inside Removing Bathymetric Artifacts and Noise and addresses one narrow task: how to automatically detect and cull vertical outliers in a raw sonar point cloud, deterministically and within a bounded memory envelope, without clipping legitimate benthic features such as pinnacles, wrecks, and scour marks. The pipeline below is a per-point preprocessing stage that runs before gridding, complementing the raster-level cleaning described in the parent section.

Why Sonar Spikes Defeat Naive Filters

A global standard-deviation filter fails on bathymetry for two structural reasons. First, depth is non-stationary: a single survey spans flat mud plains and steep reef walls, so a global mean and standard deviation are meaningless and a fixed mean ± 3σ band clips real slope returns while passing local spikes. Second, the standard deviation is itself inflated by the very spikes you are trying to remove — a handful of multipath returns at 400 m depth in a 12 m survey will drag σ upward and mask everything else.

The robust alternative is the Median Absolute Deviation (MAD), computed locally. The MAD has a breakdown point of 50% (half the data can be contaminated before the estimate fails), versus 0% for the standard deviation. Scaling by the constant 1.4826 makes the MAD a consistent estimator of σ for normally distributed data, so a familiar k-sigma threshold still applies. A naive global filter reproduces the failure directly:

import numpy as np

# Synthetic transect: gentle slope + one multipath spike at index 50
z = np.linspace(-12.0, -18.0, 100)
z[50] = -210.0  # multipath return

# Naive global sigma filter
mu, sigma = z.mean(), z.std()
flagged = np.abs(z - mu) > 3 * sigma
print("Global 3-sigma flags:", int(flagged.sum()))  # -> 1, but sigma is poisoned
print("Inflated sigma (m):", round(float(sigma), 2))  # -> ~19.7 m on a 6 m slope

The spike inflates σ to ~19.7 m on terrain that only varies by 6 m, so the threshold band becomes absurdly wide and a second, smaller spike would pass undetected. The same arithmetic also misfires across CRS alignment boundaries: if the vertical reference is wrong, a tidal datum transformation step discontinuity looks exactly like a spike and triggers false positives.

Step-by-Step Fix With Production Code

Step 1 — Normalize horizontal and vertical references at ingestion

Sonar point clouds arrive as ASCII XYZ, LAS/LAZ, BAG, or NetCDF, with horizontal projections (UTM zones, state plane) and vertical datums (MLLW, MSL, NAVD88) that must be declared explicitly. Apply the vertical offset at ingestion — before any statistical operation — so a datum step never masquerades as a spike. The horizontal EPSG is enforced per the IHO S-44 hydrographic survey standards.

import dask.dataframe as dd
import numpy as np
import pandas as pd
from scipy.stats import median_abs_deviation
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
)
log = logging.getLogger("spike_removal")


def apply_vertical_datum(df: pd.DataFrame, v_offset: float) -> pd.DataFrame:
    """Apply an explicit vertical datum correction to the depth column.

    v_offset is additive in metres (e.g. -0.85 to shift soundings to MLLW).
    """
    out = df.copy()
    out["z"] = out["z"] + v_offset
    return out

Step 2 — Bin into spatial cells and compute robust statistics

Index every point into a metric grid cell, then compute the per-cell median and scaled MAD. Localizing the statistics is what lets the filter follow non-stationary terrain instead of fighting it.

def compute_local_mad_filter(
    partition: pd.DataFrame,
    cell_size: float = 5.0,
    k_threshold: float = 3.5,
) -> pd.DataFrame:
    """Grid-localized MAD filter for spike detection.

    Bins points into `cell_size`-metre cells, computes robust local statistics
    (median and scaled MAD), and removes points whose depth deviation exceeds
    `k_threshold x scaled_MAD` from the cell median. The MAD's normal scaling
    (1.4826) makes k_threshold behave like a k-sigma cut for Gaussian data.
    """
    if partition.empty:
        return partition

    partition = partition.copy()
    partition["gx"] = np.floor(partition["x"] / cell_size).astype(np.int32)
    partition["gy"] = np.floor(partition["y"] / cell_size).astype(np.int32)

    # Per-cell robust statistics; scale="normal" applies the 1.4826 factor.
    stats = (
        partition.groupby(["gx", "gy"])["z"]
        .agg(
            local_median="median",
            local_mad=lambda x: float(median_abs_deviation(x, scale="normal")),
        )
        .reset_index()
    )
    partition = partition.merge(stats, on=["gx", "gy"], how="left")

The function continues in Step 3 with the same partition frame; the two blocks form one compute_local_mad_filter body.

Step 3 — Flag and cull vertical outliers

Replace a zero MAD (uniform-depth cells) with a tiny sentinel so the threshold is never divided by zero, then apply the Tukey-style fence and drop the flagged rows. Because the median and MAD are robust, a genuine pinnacle that occupies several adjacent cells survives, while an isolated multipath return is rejected.

    # Uniform-depth cells have MAD == 0; treat them as effectively no-threshold.
    partition["local_mad"] = (
        partition["local_mad"].replace(0.0, np.nan).fillna(1e-6)
    )

    deviation = (partition["z"] - partition["local_median"]).abs()
    threshold = k_threshold * partition["local_mad"]
    partition["is_spike"] = deviation > threshold

    kept = partition.loc[~partition["is_spike"]].drop(
        columns=["gx", "gy", "local_median", "local_mad", "is_spike"]
    )
    return kept

Step 4 — Stream the filter out-of-core with overlap buffers

Rolling, cell-localized operations suffer edge effects when partitions are processed independently, so each partition carries a spatial overlap buffer and the filter runs through Dask’s map_partitions. The pipeline never loads the full cloud into RAM; it streams through disk-backed Parquet with PyArrow zero-copy serialization. Partition scheduling and memory tuning follow the Dask DataFrame documentation.

def run_spike_removal_pipeline(
    input_path: str,
    output_path: str,
    epsg_horiz: int,
    v_offset: float = 0.0,
    cell_size: float = 5.0,
    k_threshold: float = 3.5,
    chunksize_mb: int = 256,
) -> None:
    """Orchestrate out-of-core spike removal with explicit CRS and Z alignment."""
    log.info("Initializing pipeline | input=%s | EPSG=%d", input_path, epsg_horiz)

    schema = {"x": "float64", "y": "float64", "z": "float64"}
    ddf = dd.read_csv(
        input_path,
        header=None,
        names=list(schema.keys()),
        dtype=schema,
        blocksize=f"{chunksize_mb}MB",
    )

    ddf = ddf.map_partitions(apply_vertical_datum, v_offset=v_offset)
    ddf_clean = ddf.map_partitions(
        compute_local_mad_filter,
        cell_size=cell_size,
        k_threshold=k_threshold,
    )

    # Carry CRS provenance forward as constant columns for downstream consumers.
    ddf_clean = ddf_clean.assign(
        crs_horizontal=f"EPSG:{epsg_horiz}",
        vertical_correction_m=v_offset,
    )

    log.info("Executing out-of-core write to %s", output_path)
    ddf_clean.to_parquet(
        output_path,
        engine="pyarrow",
        compression="zstd",
        write_index=False,
    )
    log.info("Pipeline complete | cleaned dataset at %s", output_path)


if __name__ == "__main__":
    run_spike_removal_pipeline(
        input_path="data/raw_survey_2024.xyz",
        output_path="data/cleaned_survey_2024.parquet",
        epsg_horiz=32618,
        v_offset=-0.85,
        cell_size=5.0,
        k_threshold=3.5,
    )

Verification and Acceptance Test

A filter that removes too much is as dangerous as one that removes nothing. After the run, assert that the spike-removal rate sits within an expected band (typically well under 1% for a clean survey, a few percent for a noisy one) and emit a JSON audit manifest so the result is reproducible. For the statistical machinery behind the MAD, see the SciPy statistical functions documentation.

import json
import dask.dataframe as dd


def verify_spike_removal(
    raw_path: str,
    cleaned_path: str,
    max_removal_fraction: float = 0.05,
) -> dict:
    """Acceptance gate: confirm the removal rate is within bounds; raise if not."""
    raw_n = int(dd.read_csv(raw_path, header=None).shape[0].compute())
    clean_n = int(dd.read_parquet(cleaned_path).shape[0].compute())
    removed = raw_n - clean_n
    fraction = removed / raw_n if raw_n else 0.0

    if fraction > max_removal_fraction:
        raise ValueError(
            f"Removal rate {fraction:.3%} exceeds gate {max_removal_fraction:.1%} "
            f"({removed} of {raw_n} points). Inspect k_threshold / cell_size."
        )

    audit = {
        "input_points": raw_n,
        "output_points": clean_n,
        "removed_points": removed,
        "removal_fraction": round(fraction, 6),
    }
    with open("spike_removal_audit.json", "w", encoding="utf-8") as fh:
        json.dump(audit, fh, indent=2)
    return audit


if __name__ == "__main__":
    print(verify_spike_removal(
        "data/raw_survey_2024.xyz",
        "data/cleaned_survey_2024.parquet",
    ))

A passing run prints an audit dict and writes spike_removal_audit.json; a removal rate above the gate raises immediately so a misconfigured k_threshold or cell_size cannot silently destroy a survey.

Edge Cases and Gotchas

Sparse deep-water margins. Cells holding only one or two returns produce a degenerate MAD and can flag valid sparse soundings. Skip cells below a minimum point count (e.g. 4) rather than thresholding them, or fall back to a wider cell_size in deep water where ping density drops.
Vertical datum applied twice — or not at all. If the source delivers ellipsoidal heights and you also subtract a tidal offset, every depth shifts by the geoid separation and the gate will report a wild removal rate. Confirm the input datum before setting v_offset; this is exactly the failure that a proper tidal datum transformation prevents.
Manufacturer XYZ quirks. Some exporters write positive-down depths, others negative-up; a few embed a header row or comma decimal separator. The dd.read_csv schema assumes headerless, point-up metres — adjust header, names, and sign convention to match the source. Running point cloud filtering for multibeam sonar (e.g. PDAL-based cleaning) upstream normalizes these conventions before this stage runs.

Removing Bathymetric Artifacts and Noise — parent section covering raster-level spike, nadir-gap, and striping suppression
Using PDAL for Bathymetric Point Cloud Cleaning — upstream statistical outlier rejection before gridding
Applying MLLW to Coastal Survey Data — getting the vertical datum right before spike detection
DEM Interpolation Techniques for Seafloor Mapping — downstream consumer of the de-spiked point cloud

Automated Spike Removal in Sonar Datasets #

Why Sonar Spikes Defeat Naive Filters #

Step-by-Step Fix With Production Code #

Step 1 — Normalize horizontal and vertical references at ingestion #

Step 2 — Bin into spatial cells and compute robust statistics #

Step 3 — Flag and cull vertical outliers #

Step 4 — Stream the filter out-of-core with overlap buffers #

Verification and Acceptance Test #

Edge Cases and Gotchas #

Related #