Robust Adversarial Training via Latent Perturbations

Main Article Content

Sopida Chavalit
Nguyen Thi Lan
Aisha Ndlovu

Abstract




Adversarial training has become a cornerstone technique in enhancing the robustness of neural networks against input perturbations. In this paper, we introduce Robust Adversarial Training via Latent Perturbations (RAT-LP), which generates adversarial perturbations in the latent feature space rather than directly in the input space. We provide a theoretical robustness guarantee under Lipschitz continuity assumptions and empirically validate RAT- LP on a toy two-moons dataset. Our results demonstrate superior robustness to input and latent adversarial attacks compared to standard training, highlighting the effectiveness of latent space perturbations in capturing semantic representations for robust learning.




Article Details

How to Cite
Chavalit, S., Lan, N. T., & Ndlovu, A. (2025). Robust Adversarial Training via Latent Perturbations. Special Interest Group on Artificial Intelligence Research, 2(1). Retrieved from https://sigair.org/index.php/journal/article/view/24
Section
Articles

References

Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, and Yingchun Wang. Probing the robustness of large language models safety to latent perturbations. arXiv preprint arXiv:2506.16078, 2025.

Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a clas- sifier against adversarial manipulation. Advances in neural information processing systems, 30, 2017.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. Advances in neural information processing systems, 32, 2019.

Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. Deepmutation: Mutation testing of deep learning systems. In 2018 IEEE 29th international symposium on software reliability engineering (ISSRE), pages 100–111. IEEE, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

Fabian Pedregosa, Ga ̈el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.

Zhuang Qian, Shufei Zhang, Kaizhu Huang, Qiufeng Wang, Rui Zhang, and Xinping Yi. Improving model robustness with latent distribution locally and globally. arXiv preprint arXiv:2107.04401, 2021.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adver- sarial examples. arXiv preprint arXiv:1801.09344, 2018.

Andrew Ross and Finale Doshi-Velez. Improving the adversarial robustness and inter- pretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

Florian Tram`er, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.