Papers
arxiv:2603.11273

Duration Aware Scheduling for ASR Serving Under Workload Drift

Published on Mar 11
· Submitted by
Niels Rogge
on Jun 19
Authors:
,
,

Abstract

Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput.

Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. Yet, widely used serving engines rely on first-come-first-served (FCFS) scheduling, which ignores variability in request duration and leads to head-of-line blocking under workload drift. We show that audio duration is an accurate proxy for job processing time in ASR models such as Whisper, and use this insight to enable duration-aware scheduling. We integrate two classical algorithms, Shortest Job First (SJF) and Highest Response Ratio Next (HRRN), into vLLM and evaluate them under realistic and drifted workloads. On LibriSpeech test-clean, compared to baseline, SJF reduces median E2E latency by up to 73% at high load, but increases 90th-percentile tail latency by up to 97% due to starvation of long requests. HRRN addresses this trade-off: it reduces median E2E latency by up to 28% while bounding tail-latency degradation to at most 24%. These gains persist under workload drift, with no throughput penalty and <0.1\,ms scheduling overhead per request.

Community

Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Neat paper. It is interesting to see how much latency gets stuck behind head-of-line blocking in ASR pipelines just because we default to FCFS. Using audio duration as a proxy for job time seems like a clever, low-overhead way to handle workload drift without needing to overhaul the entire serving engine.

I am curious if you have thoughts on how HRRN would handle scenarios where the workload drift includes a sudden spike of very long audio files?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/e7f89c08-dc26-4b25-a1f1-9028f3df1b62

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.11273
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.11273 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.11273 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.11273 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.