A Reinforcement Learning Autopilot for Fixed-Wing UAVs with Windowed Violation Summaries and Bounded Reward Reweighting

doi:10.3390/drones10070489

DOI: 10.3390/drones10070489 ISSN: 2504-446X

A Reinforcement Learning Autopilot for Fixed-Wing UAVs with Windowed Violation Summaries and Bounded Reward Reweighting

Yan Kang, Tingwei Ji, Fangfang Xie, Chenglou Liu, Zihao Yuan

Gain-scheduled and cascaded proportional–integral–derivative (PID) autopilots remain common practical baselines for fixed-wing unmanned aerial vehicles (UAVs), but training one shared learned controller for heading, altitude, and true airspeed across several maneuvers remains difficult. We study this problem under a strict reach-then-hold benchmark in which all the active channels must enter prescribed green bands and remain there for a terminal hold window. The proposed training recipe combines proximal policy optimization (PPO) with a tri-band maneuver-tracking reward and an outer bounded reward reweighting (BDR) step that updates the base reward weights from recent violation summaries under a Kullback–Leibler (KL) gate. In the JSBSim F-16 six-degree-of-freedom dynamics model, used here as a challenging surrogate benchmark for fixed-wing UAV autopilot learning, the learned controller transfers across a fixed five-lesson sequence, reaches strict success rates of 0.966 on turn and 0.921 on climb, and issues substantially smaller executed-command updates than the shared fixed-gain PID reference used here. Under the reported lesson sequence and step budget, fixed-weight PPO and a reweighting-only variant stall under the same envelopes, while speed remains the main bottleneck for both controllers. We further report exploratory long-horizon tracking, difficult-command stress checks, and an added command-filtered nonlinear dynamic-surface-control (CF-DSC) reference without retraining the learned policy. The CF-DSC results confirm that advanced non-reinforcement-learning (non-RL) controllers can be strong reference methods; therefore, within this reported simulator setup, BDR should be read as a practical and inspectable reward-scheduling heuristic for shared triad tracking rather than as a proof of superiority over all classical, nonlinear, or model-based controllers.

Outline

A Reinforcement Learning Autopilot for Fixed-Wing UAVs with Windowed Violation Summaries and Bounded Reward Reweighting

More from our Archive