Automated Backend Allocation for Multi-Model, On-Device AI Inference

doi:10.1145/3626793

DOI: 10.1145/3626793 ISSN: 2476-1249

Automated Backend Allocation for Multi-Model, On-Device AI Inference

Venkatraman Iyer, Sungho Lee, Semun Lee, Juitem Joonwoo Kim, Hyunjun Kim, Youngjae Shin

Computer Networks and Communications
Hardware and Architecture
Safety, Risk, Reliability and Quality
Computer Science (miscellaneous)

On-Device Artificial Intelligence (AI) services such as face recognition, object tracking and voice recognition are rapidly scaling up deployments on embedded, memory-constrained hardware devices. These services typically delegate AI inference models for execution on CPU and GPU computing backends. While GPU delegation is a common practice to achieve high speed computation, the approach suffers from degraded throughput and completion times under multi-model scenarios, i.e. concurrently executing services. This paper introduces a solution to sustain performance in multi-model, on-device AI contexts by dynamically allocating a combination of CPU and GPU backends per model. The allocation is feedback-driven, and guided by a knowledge of model-specific, multi-objective pareto fronts comprising inference latency and memory consumption. Primary contribution of this paper is a backend allocation algorithm that runs online per model, and achieves 25-100% improvement in throughput over static allocations as well as load-balancing scheduler solutions targeting multi-model scenarios. Other noteworthy contributions include a novel pareto front estimator that runs on-device, and also a software-based GPU profiler with a lightweight algorithm to detect changing GPU workloads. Specifically, the pareto front estimator outperforms state of the art algorithms NSGA-II and SPEA2 by 94% on pareto coverage, and by almost 2x on computational overhead.

Outline

Automated Backend Allocation for Multi-Model, On-Device AI Inference

More from our Archive