Regional Strategy Composition: A Hierarchical-Action Reinforcement Learning Framework for Dynamic Smart-Meter Association over 5G NR mMTC Networks
Muhammed Al-Ali, Esteban Inga, Juan Inga, Elias YaacoubAdvanced Metering Infrastructure (AMI) over 5G New Radio (NR) massive machine-type communication (mMTC) networks require efficient and adaptive communication mechanisms to support reliable data delivery for large numbers of smart meters under dynamic traffic and channel conditions. In this work, we propose a framework in which each smart meter chooses, at runtime, whether to transmit directly to the base station (BS) or via a nearby Data Aggregation Point (DAP). The optimal choice is dynamic and depends on DAP buffer occupancy, periodic congestion, channel quality, and packet deadline pressure. Formulating this as a per-meter binary decision yields an action space of size 2N for N meters, which is intractable for reinforcement learning (RL). We reformulate the problem as regional strategy composition: the RL agent selects one parameterized association strategy for each DAP region from a small library of interpretable rules, and a deterministic mapping expands the regional choice into per-meter modes. It reduces the policy action space from 2N to KD, where D is the number of DAPs and K the number of strategies, while preserving meter-level control granularity. We evaluate Proximal Policy Optimization (PPO) and Deep Q-Network (DQN) controllers against eight meter-level baselines on a 5G NR-calibrated simulator with 1500 m, six DAPs, deadline-bounded delivery, stale channel-state information, and phase-offset congestion cycles. Across three traffic regimes and five random seeds, PPO improves packet delivery ratio (PDR) over the strongest heuristic by +0.63, +2.41, and +2.66 percentage points under baseline, high-load, and bursty-cycle conditions, respectively; all gains are statistically significant (paired t-test, p<0.001; Cohen’s d up to 5.12), and the advantage grows with traffic stress. The results show that learned regional composition of classical heuristics outperforms any single fixed heuristic precisely when no individual rule is globally optimal.