A Comparative Study of Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) on a Toy Environment

Saptarshi Mukherjee; Rohit Parashar; Aniket Joshi

PDF

Published: Jul 14, 2025

Saptarshi Mukherjee

Korea University

Rohit Parashar

Korea University

Aniket Joshi

National University of Singapore

Abstract

Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) are prominent reinforcement learning algorithms designed to improve policy learning efficiency and stability. This paper presents a comparative study between PPO and DPO applied to a simple multi-armed bandit toy environment. We implement both algorithms with comparable hyperparameters and evaluate their performance over multiple random seeds. Our experiments measure cumulative rewards, convergence speed, and learning stability. Results indicate that DPO achieves higher average rewards and faster convergence than PPO in this setting. The analysis provides insights into the operational differences between these algorithms, contributing a foundational understanding beneficial for future reinforcement learning research and ap- plications.

How to Cite

Mukherjee, S., Parashar, R., & Joshi, A. (2025). A Comparative Study of Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO) on a Toy Environment. Special Interest Group on Artificial Intelligence Research, 1(1). Retrieved from https://sigair.org/index.php/journal/article/view/15

Issue

Vol. 1 No. 1 (2025)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

References

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

Article Sidebar

Main Article Content

Abstract

Article Details

References