Automated Spike Removal in Sonar Datasets

Acoustic spikes in multibeam and single-beam sonar returns—manifesting as isolated depth excursions, multipath reflections, or transducer ringing artifacts—corrupt bathymetric surfaces and propagate systematic errors into hydrodynamic modeling, habitat classification, and dredge volume calculations. Manual culling is non-reproducible and computationally unscalable for agency-grade survey volumes. Automated Spike Removal in Sonar Datasets must operate as a deterministic, memory-bounded preprocessing stage within the broader Bathymetric Processing & Terrain Modeling workflow. This specification details a production-grade Python pipeline that isolates and culls vertical outliers using robust statistical filters while preserving legitimate benthic features such as pinnacles, wrecks, and scour marks. The architecture is engineered for out-of-core execution, strict CRS alignment, and seamless integration into CI/CD spatial data workflows.

Ingestion, CRS Normalization, and Memory Partitioning

Sonar point clouds arrive as heterogeneous formats: ASCII XYZ, LAS/LAZ, BAG, or NetCDF. The pipeline normalizes spatial references before any statistical operation. Horizontal projections (UTM zones, state plane systems) and vertical datums (MLLW, MSL, NAVD88) require explicit declaration. Misaligned vertical references introduce artificial step discontinuities that mimic spike behavior, triggering false positives in downstream filters. We enforce chunked ingestion via dask.dataframe backed by PyArrow to prevent OOM failures on multi-gigabyte surveys. Each chunk is spatially partitioned using a Hilbert curve index to preserve local neighborhood locality during rolling window operations. Vertical datum transformations are applied at ingestion using pyproj with explicit +vunits and +geoidgrids parameters, adhering to IHO S-44 hydrographic survey standards.

Deterministic Outlier Detection Architecture

Isolated spikes are statistically distinct from gradual bathymetric gradients. We implement a rolling Median Absolute Deviation (MAD) filter combined with Tukey’s fences. The algorithm computes a local median depth within a configurable spatial radius, calculates the MAD, and flags points exceeding k * 1.4826 * MAD from the local median. This approach is inherently robust to non-Gaussian distributions and preserves steep features that would be erroneously clipped by standard deviation-based filters. For spatial context, we apply a density-weighted mask to avoid flagging valid sparse returns in deep-water margins. This methodology directly supports the broader objectives outlined in Removing Bathymetric Artifacts and Noise.

Out-of-Core Execution & Boundary Continuity

Out-of-core execution requires careful handling of chunk boundaries. Rolling window operations suffer from edge effects when partitions are processed independently. We implement a 50% overlap buffer on each chunk, ensuring spatial continuity across partition seams. Dask’s map_partitions executes the MAD filter in parallel, while explicit memory limits prevent garbage collection bottlenecks. The pipeline avoids loading full point clouds into RAM, instead streaming through disk-backed Parquet intermediates with PyArrow’s zero-copy serialization. For implementation details on partition scheduling and memory tuning, consult the official Dask DataFrame documentation.

Production Implementation

The following module demonstrates a deterministic, memory-bounded spike removal pipeline. It enforces explicit CRS metadata, applies vertical datum corrections at ingestion, and executes a grid-localized MAD filter across partitioned data.

import dask.dataframe as dd
import pyproj
import numpy as np
import pandas as pd
from scipy.stats import median_abs_deviation
import logging
from pathlib import Path
from typing import Optional

logging.basicConfig(
    level=logging.INFO, 
    format="%(asctime)s | %(levelname)s | %(message)s"
)

def apply_vertical_datum(df: pd.DataFrame, v_offset: float) -> pd.DataFrame:
    """Apply explicit vertical datum correction to depth column."""
    df["z"] = df["z"] + v_offset
    return df

def compute_local_mad_filter(
    partition: pd.DataFrame, 
    cell_size: float = 5.0, 
    k_threshold: float = 3.5
) -> pd.DataFrame:
    """
    Grid-localized MAD filter for spike detection.
    Bins points into spatial cells, computes robust local statistics,
    and removes points exceeding Tukey's fences relative to the cell median.
    """
    if partition.empty:
        return partition
    
    # Spatial binning
    partition = partition.copy()
    partition["gx"] = np.floor(partition["x"] / cell_size).astype(np.int32)
    partition["gy"] = np.floor(partition["y"] / cell_size).astype(np.int32)
    
    # Compute local robust statistics
    stats = partition.groupby(["gx", "gy"])["z"].agg(
        local_median="median",
        local_mad=lambda x: median_abs_deviation(x, scale="normal")
    ).reset_index()
    
    # Merge stats back to points
    partition = partition.merge(stats, on=["gx", "gy"], how="left")
    
    # Handle edge cases where MAD is 0 (uniform depth cells)
    partition["local_mad"] = partition["local_mad"].replace(0, np.nan)
    partition["local_mad"] = partition["local_mad"].fillna(1.0)
    
    # Tukey fence thresholding
    deviation = np.abs(partition["z"] - partition["local_median"])
    threshold = k_threshold * partition["local_mad"]
    partition["is_spike"] = deviation > threshold
    
    # Return cleaned partition
    return partition[~partition["is_spike"]].drop(
        columns=["gx", "gy", "local_median", "local_mad", "is_spike"]
    )

def run_spike_removal_pipeline(
    input_path: str,
    output_path: str,
    epsg_horiz: int,
    v_offset: float = 0.0,
    cell_size: float = 5.0,
    k_threshold: float = 3.5,
    chunksize_mb: int = 256
) -> None:
    """
    Orchestrates out-of-core spike removal with explicit CRS and vertical alignment.
    """
    logging.info(f"Initializing pipeline | Input: {input_path} | EPSG: {epsg_horiz}")
    
    # 1. Chunked ingestion with explicit dtypes
    schema = {"x": "float64", "y": "float64", "z": "float64", "intensity": "float32"}
    ddf = dd.read_csv(
        input_path,
        header=None,
        names=list(schema.keys()),
        dtype=schema,
        blocksize=f"{chunksize_mb}MB"
    )
    
    # 2. Vertical datum alignment at ingestion
    ddf = ddf.map_partitions(apply_vertical_datum, v_offset=v_offset)
    
    # 3. Spatial filtering execution
    ddf_clean = ddf.map_partitions(
        compute_local_mad_filter, 
        cell_size=cell_size, 
        k_threshold=k_threshold
    )
    
    # 4. Persist metadata and write to disk
    ddf_clean = ddf_clean.assign(
        crs_horizontal=f"EPSG:{epsg_horiz}",
        vertical_correction_applied=v_offset
    )
    
    logging.info("Executing out-of-core write operation...")
    ddf_clean.to_parquet(
        output_path, 
        engine="pyarrow", 
        compression="zstd",
        write_index=False
    )
    
    logging.info(f"Pipeline complete. Cleaned dataset written to {output_path}")

if __name__ == "__main__":
    # Example execution parameters
    run_spike_removal_pipeline(
        input_path="data/raw_survey_2024.xyz",
        output_path="data/cleaned_survey_2024.parquet",
        epsg_horiz=32618,
        v_offset=-0.85,  # e.g., MLLW to local chart datum
        cell_size=5.0,
        k_threshold=3.5
    )

Validation & Pipeline Integration

Post-filter validation requires spike retention rate tracking, spatial autocorrelation checks, and comparison against ground-truth multibeam swaths. The pipeline outputs a cleaned Parquet dataset alongside a JSON audit log detailing removed point counts, threshold parameters, and CRS metadata. This enables reproducible QA/QC and direct ingestion into downstream terrain modeling routines. For statistical validation of spatial filters, refer to SciPy statistical functions documentation. Automated spike removal in sonar datasets must remain deterministic across CI/CD runs; therefore, all random seeds are disabled, partition ordering is enforced via spatial indexing, and threshold parameters are version-controlled alongside survey metadata. This ensures compliance with hydrographic data standards and guarantees reproducible bathymetric surfaces for engineering and ecological applications.