SCIENTIA SINICA Informationis, Volume 50 , Issue 3 : 375-395(2020) https://doi.org/10.1360/SSI-2019-0184

Feasibility of reinforcement learning for UAV-based target searching in a simulated communication denied environment

More info
  • ReceivedAug 27, 2019
  • AcceptedOct 4, 2019
  • PublishedFeb 27, 2020


Funded by

2018年度科技创新2030 —“新一代人工智能"重大项目(2018AAA0102302)



[1] Tomic T, Schmid K, Lutz P. Toward a Fully Autonomous UAV: Research Platform for Indoor and Outdoor Urban Search and Rescue. IEEE Robot Automat Mag, 2012, 19: 46-56 CrossRef Google Scholar

[2] Doherty P, Rudol P. A uav search and rescue scenario with human body detection and geolocalization. In: Proceedings of Australasian Joint Conference on Artificial Intelligence, 2007. 1--13. Google Scholar

[3] Li C C, Zhang G S, Lei T J. Quick image-processing method of UAV without control points data in earthquake disaster area. Trans Nonferrous Met Soc China, 2011, 21: s523-s528 CrossRef Google Scholar

[4] Ryan A, Zennaro M, Howell A, et al. An overview of emerging results in cooperative uav control. In: Proceedings of 2004 43rd IEEE Conference on Decision and Control (CDC), 2004. 602--607. Google Scholar

[5] Chmaj G, Selvaraj H. Distributed processing applications for uav/drones: a survey. In: Proceedings of Progress in Systems Engineering, 2015. 449--454. Google Scholar

[6] Srinivasan S, Latchman H, Shea J, et al. Airborne traffic surveillance systems: video surveillance of highway traffic. In: Proceedings of the ACM 2nd International Workshop on Video Surveillance & Sensor Networks, 2004. 131--135. Google Scholar

[7] O'Young S, Hubbard P. Raven: a maritime surveillance project using small uav. In: Proceedings of 2007 IEEE Conference on Emerging Technologies and Factory Automation (EFTA 2007), 2007. 904--907. Google Scholar

[8] Xu X W, Lai J Z, Lv P, et al. A Literature Review on the Research Status and Progress of Cooperative Navigation Technology for Multiple UAVs. Navigation Positioning and Timing, 2017, 4: 1--9. Google Scholar

[9] Li L, Wang T, Hu Q L, et al. White force network in DARPA CODE program. Aerospace Electron Warfare, 2018, 34: 54--59. Google Scholar

[10] Chen J. Aerodynamic Missile J, 2016, 1: 24--26. Google Scholar

[11] Duan H B, Li P. Autonomous control for unmanned aerial vehicle swarms based on biological collective behaviors. Sci Tech Rev, 2017, 35: 17--25. Google Scholar

[12] Wu Y J. Aerospace Electron Warfare, 2002, 3: 22--25. Google Scholar

[13] Balamurugan G, Valarmathi J, Naidu V P S. Survey on uav navigation in gps denied environments. In: Proceedings of International Conference on Signal Processing, 2017. Google Scholar

[14] Cesare K, Skeele R, Yoo S-H, et al. Multi-uav exploration with limited communication and battery. In: Proceedings of 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015. 2230--2235. Google Scholar

[15] Gupta L, Jain R, Vaszkun G. Survey of Important Issues in UAV Communication Networks. IEEE Commun Surv Tutorials, 2016, 18: 1123-1152 CrossRef Google Scholar

[16] Duan H B, Zhang D F, Fan Y M. From wolf pack intelligence to UAV swarm cooperative decision-making. Sci Sin-Inf, 2019, 49: 112-118 CrossRef Google Scholar

[17] Silver D, Hubert T, Schrittwieser J, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. 2017,. arXiv Google Scholar

[18] Silver D, Schrittwieser J, Simonyan K. Mastering the game of Go without human knowledge.. Nature, 2017, 550: 354-359 CrossRef PubMed Google Scholar

[19] Vinyals O, Ewalds T, Bartunov S, et al. Starcraft ii: A new challenge for reinforcement learning. 2017,. arXiv Google Scholar

[20] Wang Y S, Tong Y X, Long C, et al. Adaptive dynamic bipartite graph matching: a reinforcement learning approach. In: Proceedings of 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019. 1478--1489. Google Scholar

[21] Beard W R. Multiple uav cooperative search under collision avoidance and limited range communication constraints. In: Proceedings of IEEE Conference on Decision & Control, 2003. Google Scholar

[22] Pham H X, La H M, Feil-Seifer D, et al. Cooperative and distributed reinforcement learning of drones for field coverage. 2018,. arXiv Google Scholar

[23] Zhang B C, Mao Z L, Liu W Q. Geometric Reinforcement Learning for Path Planning of UAVs. J Intell Robot Syst, 2015, 77: 391-409 CrossRef Google Scholar

[24] Zeng Y, Xu X. Path design for cellular-connected uav with reinforcement learning. 2019,. arXiv Google Scholar

[25] Ghavamzadeh M, Mahadevan S, Makar R. Hierarchical multi-agent reinforcement learning. Auton Agent Multi-Agent Syst, 2006, 13: 197-229 CrossRef Google Scholar

[26] Sutton R S, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999, 112: 181-211 CrossRef Google Scholar

[27] Kaelbling L P, Littman M L, Moore A W. Reinforcement Learning: A Survey. jair, 1996, 4: 237-285 CrossRef Google Scholar

[28] Sutton R S, Barto A G. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 1998, 9: 1054--1054. Google Scholar

[29] Watkins C J C H, Dayan P. Q-Learning. Mach Learn, 1992, 8: 279-292 CrossRef Google Scholar

[30] Mnih V, Kavukcuoglu K, Silver D. Human-level control through deep reinforcement learning.. Nature, 2015, 518: 529-533 CrossRef PubMed Google Scholar

[31] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. 2015,. arXiv Google Scholar

[32] Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn, 2012, 4: 26--31. Google Scholar

[33] Melo F S, Ribeiro M I. Q-Learning with linear function approximation. In: Proceedings of International Conference on Computational Learning Theory, 2007. 308--322. Google Scholar

[34] Rahimi A, Recht B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In: Proceedings of Advances in neural information processing systems, 2009. 1313--1320. Google Scholar

[35] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning. In: Proceedings of International Conference on Machine Learning, 2016. 1928--1937. Google Scholar

[36] Williams R J, Peng J. Function Optimization using Connectionist Reinforcement Learning Algorithms. Connection Sci, 1991, 3: 241-268 CrossRef Google Scholar

[37] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms. 2017,. arXiv Google Scholar

[38] Abbeel P, Schulman J. Deep reinforcement learning through policy optimization. In: Proceedings of Tutorial at Neural Information Processing Systems Conference, 2016. Google Scholar

[39] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. 2018,. arXiv Google Scholar

[40] Grondman I, Busoniu L, Lopes G A D. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Trans Syst Man Cybern C, 2012, 42: 1291-1307 CrossRef Google Scholar

[41] Chen Y, Zhang H, Xu M. The coverage problem in uav network: A survey. In: Proceedings of Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 2014. 1--5. Google Scholar

[42] Araujo J, Sujit P, Sousa J B. Multiple uav area decomposition and coverage. In: Proceedings of 2013 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2013. 30--37. Google Scholar

[43] Busoniu L, Babuska R, De Schutter B. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans Syst Man Cybern C, 2008, 38: 156-172 CrossRef Google Scholar

[44] Fasano G, Accardo D, Moccia A, et al. Multisensor based fully autonomous non-cooperative collision avoidance system for uavs. Journal of Aerospace Computing Information & Communication, 2008, 5: 338--360 https://doi.org/10.2514/6.2007-2847. Google Scholar

[45] La H M, Lim R, Sheng W. Multirobot Cooperative Learning for Predator Avoidance. IEEE Trans Contr Syst Technol, 2015, 23: 52-63 CrossRef Google Scholar

[46] Li J Q, Deng G Q, Luo C W. A Hybrid Path Planning Method in Unmanned Air/Ground Vehicle (UAV/UGV) Cooperative Systems. IEEE Trans Veh Technol, 2016, 65: 9585-9596 CrossRef Google Scholar

  • Figure 1

    (Color online) Simulation environment. (a) Overview of battlefield simulation environment; (b) partial enlargement

  • Figure 2

    (Color online) RQ1 result. (a) The mean of target acquired; (b) mission complete rate; (c) the mean of mission complete time

  • Figure 3

    RQ3 result. (a) Mission complete rate; (b) the mean of mission complete time; (c) the mean of target acquired

  • Table 1   Reward settings
    Environmental feedback Reward
    Finding a target 1000
    Destorying a drone on enemy side 125
    A drone being destroyed $-$125
    Moving a step $-$1
  • Table 2   RQ1 experiment settings
    Number Algorithm (red) Algorithm (blue)
    1 Random walk Random walk
    2 DQN Random walk
    3 L-QL Random walk
    4 A3C Random walk
    5 DPPO Random walk
  • Table 3   RQ2 experiment settings
    Number Algorithm (red) Algorithm (blue)
    1 DQN L-QL
    2 DQN DPPO
    3 L-QL DPPO
  • Table 4   Score settings
    Missing result Score
    One target acquired 1
    Both targets acquired 2
    No target acquired 0
    Each drone being destroyed $-0.1$
  • Table 5   Victory rate (VR) (%)
    DQN 72.2 90.1
    L-QL 14.9 42.7
    DPPO 5.1 28.4
  • Table 6   Mission complete rate (MCR) (%)
    DQN 49.1 72.0
    L-QL 4.6 20.4
    DPPO 0.7 2.6
  • Table 7   Mission complete time steps ($\overline{\rm~MCT}$)
    DQN 358.56 297.79
    L-QL 484.61 613.55
    DPPO 905.14 683.58