A Comparative Study of Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) on a Toy Environment

Main Article Content

Saptarshi Mukherjee
Rohit Parashar
Aniket Joshi

Abstract




Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) are prominent reinforcement learning algorithms designed to improve policy learning efficiency and stability. This paper presents a comparative study between PPO and DPO applied to a simple multi-armed bandit toy environment. We implement both algorithms with comparable hyperparameters and evaluate their performance over multiple random seeds. Our experiments measure cumulative rewards, convergence speed, and learning stability. Results indicate that DPO achieves higher average rewards and faster convergence than PPO in this setting. The analysis provides insights into the operational differences between these algorithms, contributing a foundational understanding beneficial for future reinforcement learning research and ap- plications.




Article Details

How to Cite
Mukherjee, S., Parashar, R., & Joshi, A. (2025). A Comparative Study of Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) on a Toy Environment. Special Interest Group on Artificial Intelligence Research, 1(1). Retrieved from https://sigair.org/index.php/journal/article/view/15
Section
Articles

References

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.