Adaptive Frame Sampling for Real-Time Video Object Detection and Multi-Object Tracking on Edge Devices
Department of Computer Science & System Engineering, GITAM School of Computer Science & Engineering, GITAM (Deemed to be) University, Hyderabad, Telangana, 502329, India
Abstract
Real-time multi-object tracking on CPU-only edge devices is constrained by the high per-frame inference cost of deep neural network detectors. We present the Adaptive Frame Sampling System (AFSS), a training-free, architecture-agnostic framework that dynamically allocates computation across three actions per frame: full YOLOv8 inference (FULL), phase-correlation feature warping (WARP), or result reuse (SKIP), governed by a lightweight scene complexity estimator (<0.5 ms). A 803-parameter PolicyMLP trained via behavioural cloning replaces hand-tuned thresholds. On CPU-only hardware, AFSS achieves 5-7x speedup over the full-inference baseline while reducing GFLOPs by 84.3% and incurring only a 2.6% MOTA degradation. Crucially, AFSS requires no retraining of the backbone detector and outperforms uniform frame-skipping on every accuracy metric at equivalent compute budgets.
Keywords
Graphical Abstract

Novelty Statement
This study presents the Adaptive Frame Sampling System (AFSS), a training-free, architecture-agnostic framework for real-time video object detection and multi-object tracking on CPU-only edge devices. The framework achieves 5-7x CPU speedup with only 2.0 pp.

