Automated Backend Allocation for Multi-Model, On-Device AI Inference
Venkatraman Iyer, Sungho Lee, Semun Lee, Juitem Joonwoo Kim, Hyunjun Kim, Youngjae Shin- Computer Networks and Communications
- Hardware and Architecture
- Safety, Risk, Reliability and Quality
- Computer Science (miscellaneous)
On-Device Artificial Intelligence (AI) services such as face recognition, object tracking and voice recognition are rapidly scaling up deployments on embedded, memory-constrained hardware devices. These services typically delegate AI inference models for execution on CPU and GPU computing backends. While GPU delegation is a common practice to achieve high speed computation, the approach suffers from degraded throughput and completion times under multi-model scenarios, i.e. concurrently executing services. This paper introduces a solution to sustain performance in multi-model, on-device AI contexts by dynamically allocating a combination of CPU and GPU backends per model. The allocation is feedback-driven, and guided by a knowledge of model-specific, multi-objective pareto fronts comprising inference latency and memory consumption. Primary contribution of this paper is a backend allocation algorithm that runs online per model, and achieves 25-100% improvement in throughput over static allocations as well as load-balancing scheduler solutions targeting multi-model scenarios. Other noteworthy contributions include a novel pareto front estimator that runs on-device, and also a software-based GPU profiler with a lightweight algorithm to detect changing GPU workloads. Specifically, the pareto front estimator outperforms state of the art algorithms NSGA-II and SPEA2 by 94% on pareto coverage, and by almost 2x on computational overhead.