A Comparative Study of Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) on a Toy Environment
Main Article Content
Abstract
Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) are prominent reinforcement learning algorithms designed to improve policy learning efficiency and stability. This paper presents a comparative study between PPO and DPO applied to a simple multi-armed bandit toy environment. We implement both algorithms with comparable hyperparameters and evaluate their performance over multiple random seeds. Our experiments measure cumulative rewards, convergence speed, and learning stability. Results indicate that DPO achieves higher average rewards and faster convergence than PPO in this setting. The analysis provides insights into the operational differences between these algorithms, contributing a foundational understanding beneficial for future reinforcement learning research and ap- plications.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
References
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.