SCIENCE CHINA Information Sciences, Volume 62 , Issue 5 : 052204(2019) https://doi.org/10.1007/s11432-018-9602-1

Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

More info
  • ReceivedJul 4, 2018
  • AcceptedSep 5, 2018
  • PublishedApr 2, 2019



This work was supported by National Natural Science Foundation of China (Grant No. 61203078) and the Key Project of Shenzhen Robotics Research Center NSFC (Grant No. U1613225).


[1] Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999. Google Scholar

[2] Falugi P, Kountouriotis P A, Vinter R B. Differential Games Controllers That Confine a System to a Safe Region in the State Space, With Applications to Surge Tank Control. IEEE Trans Automat Contr, 2012, 57: 2778-2788 CrossRef Google Scholar

[3] Zha W, Chen J, Peng Z. Construction of Barrier in a Fishing Game With Point Capture.. IEEE Trans Cybern, 2017, 47: 1409-1422 CrossRef PubMed Google Scholar

[4] Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306. Google Scholar

[5] Luo B, Wu H N, Huang T. Off-policy reinforcement learning for $H_\infty$ control design. IEEE Trans Cybern, 2015, 45: 65-76 CrossRef PubMed Google Scholar

[6] Bea R W. Successive Galerkin approximation algorithms for nonlinear optimal and robust control. Int J Control, 1998, 71: 717-743 CrossRef Google Scholar

[7] Abu-Khalaf M, Lewis F L, Jie Huang F L. Neurodynamic Programming and Zero-Sum Games for Constrained Control Systems. IEEE Trans Neural Netw, 2008, 19: 1243-1252 CrossRef Google Scholar

[8] Freiling G, Jank G, Abou-Kandil H. On global existence of solutions to coupled matrix Riccati equations in closed-loop Nash games. IEEE Trans Automat Contr, 1996, 41: 264-269 CrossRef Google Scholar

[9] Li T Y, Gajic Z. Lyapunov iterations for solving coupled algebraic riccati equations of nash differential games and algebraic riccati equations of zero-sum game. In: New Trends in Dynamic Games and Applications. Boston: Birkhäuser, 1995. 333--351. Google Scholar

[10] Possieri C, Sassano M. An algebraic geometry approach for the computation of all linear feedback Nash equilibria in LQ differential games. In: Proceedings of the 54th IEEE Conference on Decision and Control, Osaka, 2015. 5197--5202. Google Scholar

[11] Engwerda J C. LQ Dynamic Optimization and Differential Games. New York: Wiley, 2005. Google Scholar

[12] Mylvaganam T, Sassano M, Astolfi A. Constructive $\epsilon$-Nash Equilibria for Nonzero-Sum Differential Games. IEEE Trans Automat Contr, 2015, 60: 950-965 CrossRef Google Scholar

[13] Sutton R S, Barto A G. Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998. Google Scholar

[14] Werbos P J. Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control. NEW York: Van Nostrand, 1992. Google Scholar

[15] Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont: Athena Scientific, 1996. Google Scholar

[16] Werbos P J. Elements of intelligence. Cybernetica, 1968, 11: 131. Google Scholar

[17] Doya K. Reinforcement Learning in Continuous Time and Space. Neural Computation, 2000, 12: 219-245 CrossRef Google Scholar

[18] Wei Q L, Lewis F L, Sun Q Y, et al. Discrete-time deterministic Q-learning: a novel convergence analysis. IEEE Trans Cyber, 2016, 47: 1--14. Google Scholar

[19] Wang D, Mu C. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201 CrossRef Google Scholar

[20] Vrabie D, Pastravanu O, Abu-Khalaf M. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 2009, 45: 477-484 CrossRef Google Scholar

[21] Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48: 2699-2704 CrossRef Google Scholar

[22] Luo B, Wu H N, Huang T. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 2014, 50: 3281-3290 CrossRef Google Scholar

[23] Zhang H, Wei Q, Liu D. An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica, 2011, 47: 207-214 CrossRef Google Scholar

[24] Vrabie D, Lewis F. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theor Appl, 2011, 9: 353-360 CrossRef Google Scholar

[25] Zhu Y, Zhao D, Li X. Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based on Online Data. IEEE Trans Neural Netw Learning Syst, 2017, 28: 714-725 CrossRef PubMed Google Scholar

[26] Modares H, Lewis F L, Jiang Z P. H tracking control of completely unknown continuous-time systems via off-policy reinforcement learning.. IEEE Trans Neural Netw Learning Syst, 2015, 26: 2550-2562 CrossRef PubMed Google Scholar

[27] Kiumarsi B, Lewis F L, Jiang Z P. $H_\infty$ control of linear discrete-time systems: off-policy reinforcement learning. Automatica, 2017, 78: 144-152 CrossRef Google Scholar

[28] Vamvoudakis K G, Lewis F L, Hudas G R. Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica, 2012, 48: 1598-1611 CrossRef Google Scholar

[29] Huaguang Zhang , Lili Cui , Yanhong Luo . Near-Optimal Control for Nonzero-Sum Differential Games of Continuous-Time Nonlinear Systems Using Single-Network ADP.. IEEE Trans Cybern, 2013, 43: 206-216 CrossRef PubMed Google Scholar

[30] Zhang H, Jiang H, Luo C. Discrete-Time Nonzero-Sum Games for Multiplayer Using Policy-Iteration-Based Adaptive Dynamic Programming Algorithms.. IEEE Trans Cybern, 2017, 47: 3331-3340 CrossRef PubMed Google Scholar

[31] Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274-281 CrossRef Google Scholar

[32] Zhao D, Zhang Q, Wang D. Experience Replay for Optimal Control of Nonzero-Sum Game Systems With Unknown Dynamics.. IEEE Trans Cybern, 2016, 46: 854-865 CrossRef PubMed Google Scholar

[33] Johnson M, Kamalapurkar R, Bhasin S. Approximate N-Player Nonzero-Sum Game Solution for an Uncertain Continuous Nonlinear System.. IEEE Trans Neural Netw Learning Syst, 2015, 26: 1645-1658 CrossRef PubMed Google Scholar

[34] Liu D, Li H, Wang D. Online Synchronous Approximate Optimal Learning Algorithm for Multi-Player Non-Zero-Sum Games With Unknown Dynamics. IEEE Trans Syst Man Cybern Syst, 2014, 44: 1015-1027 CrossRef Google Scholar

[35] Song R, Lewis F L, Wei Q. Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games.. IEEE Trans Neural Netw Learning Syst, 2017, 28: 704-713 CrossRef PubMed Google Scholar

[36] Vrabie D, Lewis F L. Integral reinforcement learning for online computation of feedback Nash strategies of nonzero-sum differential games. In: Proceedings of the 49th IEEE Conference on Decision and Control, Atlanta, 2010: 3066--3071. Google Scholar

[37] Vamvoudakis K G, Modares H, Kiumarsi B. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33-52 CrossRef Google Scholar

[38] Leake R J, Liu R W. Construction of Suboptimal Control Sequences. SIAM J Control, 1967, 5: 54-63 CrossRef Google Scholar

[39] Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556-1569 CrossRef Google Scholar

[40] Watkins C, Dayan P. Q-Learning. Mach Learn, 1992, 8: 279--292. Google Scholar

[41] Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475--3479. Google Scholar

[42] Chen C L, Dong D Y, Li H X. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279-2294 CrossRef Google Scholar

[43] Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58: 122203. Google Scholar

[44] Palanisamy M, Modares H, Lewis F L. Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems.. IEEE Trans Cybern, 2015, 45: 165-176 CrossRef PubMed Google Scholar

[45] Yan P, Wang D, Li H. Error Bound Analysis of $Q$ -Function for Discounted Optimal Control Problems With Policy Iteration. IEEE Trans Syst Man Cybern Syst, 2017, 47: 1207-1216 CrossRef Google Scholar

[46] Luo B, Liu D, Wu H N. Policy Gradient Adaptive Dynamic Programming for Data-Based Optimal Control.. IEEE Trans Cybern, 2017, 47: 3341-3354 CrossRef PubMed Google Scholar

[47] Vamvoudakis K G. Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14-20 CrossRef Google Scholar

[48] Vamvoudakis K G, Hespanha J P. Cooperative Q-Learning for Rejection of Persistent Adversarial Inputs in Networked Linear Quadratic Systems. IEEE Trans Automat Contr, 2018, 63: 1018-1031 CrossRef Google Scholar

[49] Rizvi S A A, Lin Z. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213-221 CrossRef Google Scholar

[50] Li J, Chai T, Lewis F L. Off-Policy Q-Learning: Set-Point Design for Optimizing Dual-Rate Rougher Flotation Operational Processes. IEEE Trans Ind Electron, 2018, 65: 4092-4102 CrossRef Google Scholar

[51] Kleinman D. On an iterative technique for Riccati equation computations. IEEE Trans Automat Contr, 1968, 13: 114-115 CrossRef Google Scholar


    Algorithm 1 Model-based offline PI algorithm

    Step 1: (Initialization) Start with a set of initially stabilizing feedback gains $K_1^1,~\ldots,K_N^1$.

    Step 2: (Policy evaluation) For a given set of stabilizing feedback gains $K_1^l,~\ldots,K_N^l$, solve for the positive definite matrices $P_1^l,~\ldots,P_N^l$ using the following Lyapunov equations: