Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach
Jiahui Na, Wensheng WangTo address the prevalent issues of slow convergence, low exploration efficiency, and large value estimation bias in traditional Deep Q-Networks for intelligent vehicle path planning, this paper proposes an improved Dueling Double Deep Q-Network (D3QN) path-planning method guided by the Dynamic Window Approach (DWA) heuristic. The Dueling Double DQN architecture decouples state value and action advantage representations, while the dual estimator of Double DQN mitigates Q-value overestimation. A Prioritized Experience Replay (PER) mechanism samples transitions non-uniformly based on Temporal Difference error with importance sampling correction, improving the reuse of critical samples and training stability. DWA evaluation criteria are transformed into dense heuristic reward signals, enabling the agent to receive continuous multi-dimensional guidance during exploration without executing online trajectory optimization. The environment augments the sparse navigation objective with a Chebyshev goal-progress term motivated by potential-based reward shaping theory together with auxiliary DWA-style channels. The policy-invariance property of potential-based shaping is referenced only for the goal term added to the sparse task reward rather than for the full composite training return. A continuous Ackermann steering kinematic model with a pure-pursuit path-tracking controller is adopted for deployment to ensure executable trajectories under non-holonomic constraints. The proposed method (DWA-D3QN) is systematically evaluated against sparse-reward D3QN, PBRS-guided D3QN, DQN, DDQN, Dueling DQN, APF-DQN, PPO, SAC, TD3, A*, and classical DWA in a grid map environment with static and dynamic obstacles. Results are reported with statistical significance over multiple random seeds. Under complex difficulty, DWA-D3QN achieves a success rate of 94.1 ± 3.4% with a collision rate of 5.9 ± 3.4% over 15 seeds, representing improvements of 64.1 and 8.4 percentage points over the sparse-reward and PBRS-guided D3QN baselines, respectively. Ablation experiments reveal the differentiated contributions of clearance, heading, and velocity shaping terms: clearance awareness provides the strongest single contribution, heading alignment reinforces directional guidance, and velocity regularization refines trajectory quality under the joint constraints of the former two. The full composite reward achieves the lowest variance among all evaluated DRL methods, confirming enhanced training stability. Comparisons with PPO, SAC, and TD3 confirm the statistically significant advantages of the proposed framework (PPO: p=0.0010, SAC: p=0.0007, TD3: p=0.0024). ROS/Gazebo validation with an Ackermann-steered vehicle achieves a success rate of 96.0% with a collision rate of 4.0% over 50 trials, further confirming the applicability of the learned policy in continuous-state environments with realistic vehicle kinematics.