{"title": "Diversity-Driven Exploration Strategy for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10489, "page_last": 10500, "abstract": "Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive local optima, or sparse rewards.\nTo tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima. We further propose an adaptive scaling method for stabilizing the learning process. We demonstrate the effectiveness of our method in huge 2D gridworlds and a variety of benchmark environments, including Atari 2600 and MuJoCo. Experimental results show that our method outperforms baseline approaches in most tasks in terms of mean scores and exploration efficiency.", "full_text": "Diversity-Driven Exploration Strategy for Deep\n\nReinforcement Learning\n\nZhang-Wei Hong, Tzu-Yun Shann, Shih-Yang Su, Yi-Hsiang Chang, Tsu-Jui Fu,\n\nand Chun-Yi Lee\n\nDepartment of Computer Science, National Tsing Hua University\n\n{williamd4112,arielshann,at7788546,shawn420,rayfu1996ozig,cylee}\n\n@gapp.nthu.edu.tw\n\nAbstract\n\nEf\ufb01cient exploration remains a challenging research problem in reinforcement\nlearning, especially when an environment contains large state spaces, deceptive\nor sparse rewards. To tackle this problem, we present a diversity-driven approach\nfor exploration, which can be easily combined with both off- and on-policy rein-\nforcement learning algorithms. We show that by simply adding a distance measure\nregularization to the loss function, the proposed methodology signi\ufb01cantly en-\nhances an agent\u2019s exploratory behavior, and thus prevents the policy from being\ntrapped in local optima. We further propose an adaptive scaling strategy to enhance\nthe performance. We demonstrate the effectiveness of our method in huge 2D\ngridworlds and a variety of benchmark environments, including Atari 2600 and\nMuJoCo. Experimental results validate that our method outperforms baseline\napproaches in most tasks in terms of mean scores and exploration ef\ufb01ciency.\n\n1\n\nIntroduction\n\nIn recent years, deep reinforcement learning (DRL) has attracted attention in a variety of application\ndomains, such as game playing [1, 2] and robot navigation [3]. However, exploration remains a major\nchallenge for environments with large state spaces, deceptive local optima, or sparse reward signals.\nIn an environment with deceptive rewards, an agent can be trapped in local optima, and never discover\nalternate strategies to \ufb01nd larger payoffs. For example, in HalfCheetah of MuJoCo [4], the agent\nquickly learns to \ufb02ip on its back and then \u201cwiggles\u201d its way forward, which is a sub-optimal policy [5].\nIn addition, environments with only sparse rewards provide few training signals, making it hard for\nagents to discover feasible policies. A common approach for exploration is to adopt simple heuristic\nmethods, such as \u0001-greedy [1, 6] or entropy regularization [7]. However, such strategies are unlikely\nto yield satisfactory results in tasks with deceptive or sparse rewards [8, 9]. A more sophisticated line\nof methods provides agents with bonus rewards whenever they visit a novel state. For example, [10]\nuses information gain as a measurement of state novelty. In [11], a counting table is used to estimate\nthe novelty of a visited state. Neural density model [12] has also been utilized to measure bonus\nrewards for agents. In [13, 14], the novelty of a state is estimated from the prediction errors of their\nsystem dynamics models. These methods, however, often require statistical or predictive models to\nevaluate the novelty of a state, and therefore increase the complexity of the training procedure.\nIn order to deal with the issue of complexity, a few researchers attempts to embrace the idea of random\nperturbation from evolutionary algorithms [5, 8]. By adding random noise to the parameter space,\ntheir methods allow RL agents to perform exploration more consistently without introducing extra\ncomputational costs. Despite their simplicity, these methods are less ef\ufb01cient in large state spaces, as\nrandom noise changes the behavioral patterns of agents in an unpredictable fashion [5, 8]. In [15],\nthe authors propose to address the problem by training multiple value functions with bootstrap sub-\nsamples. However, it requires extra model parameters, causing additional computational overheads.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we present a diversity-driven exploration strategy, a methodology that encourages a\nDRL agent to attempt policies different from its prior policies. We propose to use a distance measure\nto modify the loss function to tackle the problems of large state spaces, deceptiveness and sparsity\nin reward signals. The distance measure evaluates the novelty between the current policy and a\nset of prior policies. Our method draws inspiration from novelty search [16\u201318], which promotes\npopulation-based exploration by encouraging novel behaviors. Our method differs from it in several\naspects. First, we cast the concept of novelty search from evolution strategies into DRL frameworks.\nSecond, we train a single agent instead of a population. Third, novelty search ignores rewards\naltogether, while ours optimizes the policy using both the reward signals and the distance measure. A\nwork parallel to ours similarly employs novelty search for exploration [19]. However, their method\nstill lies in the realm of genetic algorithms. We demonstrate that our methodology is complementary\nand easily applicable to most off- and on-policy DRL algorithms. We further propose an adaptive\nscaling strategy, which dynamically scales the effect of the distance measure for enhancing the overall\nperformance. The adaptive scaling strategy consists of two methods: a distance-based method and\na performance-based method. The former method adjusts the scaling factor based on the distance\nmeasure, while the latter method scales it according to the performances of the prior policies.\nTo validate the effectiveness of the proposed diversity-driven exploration strategy, we \ufb01rst demon-\nstrate that our method does lead to better exploratory behaviors in 2D gridworlds with deceptive or\nsparse rewards. We compare our method against a number of contemporary exploration approaches\n(i.e., the baselines). Our experimental results show that while most baseline agents employing the\ncontemporary exploration approaches are easily trapped by deceptive rewards or fail in the sparse\nreward settings, the agents employing our exploration strategy are able to overcome aforementioned\nchallenges, and learn effective policies even when the reward signals are sparse or when the environ-\nments contain deceptive rewards. We have further evaluated our method in a variety of benchmark\nenvironments, including Atari 2600 [20] and MuJoCo [4]. We performed various experiments to\ndemonstrate the bene\ufb01ts of diversity-driven exploration strategy. We show that the proposed method-\nology is superior to, or comparable with the baselines in terms of mean scores and learning time.\nMoreover, we provide a comprehensive ablative analysis of the proposed adaptive scaling strategy,\nand investigate the impact of different scaling methods on the learning curves of the DRL agents.\nThe main contributions of this paper are summarized as follows:\n\n\u2022 A simple, effective, and ef\ufb01cient exploration strategy applicable to most off- and on-policy\n\nDRL algorithms.\n\n\u2022 A promising way to deal with large state spaces, deceptive rewards, and sparse reward\n\nsettings.\n\n\u2022 A loss function designed for encouraging exploration by the use of a distance measure\n\nbetween the current policy and a limited set of the most recent policies.\n\n\u2022 An adaptive scaling strategy consisting of two scaling methods for the distance measure. It\n\nenhances the overall performance.\n\n\u2022 A comprehensive comparison between the proposed methodology and a number of contem-\n\nporary approaches, evaluated on three different environments.\n\nThe remainder of this paper is organized as the following. Section 2 provides background mate-\nrial. Section 3 walks through the proposed exploration strategy in detail. Section 4 presents the\nexperimental setup and results. Section 5 discusses the related work. Section 6 concludes this paper.\n\n2 Background\n\nIn this section, we review the concept of RL, and the off- and on-policy methods used in this paper.\n\n2.1 Reinforcement Learning\nRL is a method to train an agent to interact with an environment E. An RL agent observes a state s\nfrom the state space S of E, takes an action a from the action space A according to its policy \u03c0(a|s),\nand receives a reward r(s, a). E then transits to a new state s(cid:48). The agent\u2019s objective is to maximize\n\u03c4 =t \u03b3\u03c4\u2212tr(s\u03c4 , a\u03c4 ), where t is the current timestep,\n\u03b3 \u2208 (0, 1] the discount factor, and T the horizon. The action-value function (i.e., Q-function) of a\n\nits discounted accumulated rewards Gt = (cid:80)T\n\n2\n\n\fQ(s, a) = E(cid:2)Gt|st = s, at = a, \u03c0].\n\ngiven policy \u03c0 is de\ufb01ned as the expected return starting from a state-action pair (s, a), expressed as\n\n2.2 Off-Policy Methods\n\nOff-policy methods decouple the behavior and target policies, enabling an agent to learn using\nsamples collected by arbitrary policies or from an experience replay [1]. We brie\ufb02y review two\nrepresentative off-policy methods, namely Deep Q-Network (DQN) [1] and Deep Deterministic\nPolicy Gradient (DDPG) [21].\n\nDQN. DQN is a deep neural network (DNN) parameterized by \u03b8 for approximating the optimal\nQ-function. For exploration, it follows an \u0001-greedy strategy described in [1]. The network is trained\nwith samples drawn from an experience replay Z, and is updated according to the loss function\nLDQN expressed as:\n\n(1)\nwhere y = r(s, a) + \u03b3 maxa(cid:48) Q(s(cid:48), a(cid:48), \u03b8\u2212), U (Z) is a uniform distribution over Z, and \u03b8\u2212 the\nparameters of the target network. \u03b8\u2212 is updated by \u03b8 at prede\ufb01ned intervals.\n\nLDQN = Es,a,r,s(cid:48)\u223cU (Z)\n\n(cid:2)(y \u2212 Q(s, a, \u03b8))2(cid:3),\n\nDDPG. DDPG is an actor-critic approach based on the deterministic policy gradient (DPG) algo-\nrithm [22] that learns policies over continuous action spaces. The critic estimates the Q-function\nsimilarly as that of DQN, with a minor modi\ufb01cation to y, expressed as y = r(s, a) + \u03b3Q(s(cid:48), \u03c0(s(cid:48))).\nThe actor is trained to maximize the critic\u2019s estimated Q-values, with a loss function Lactor given by:\n(2)\n\nLactor = \u2212Es\u223cZ(cid:2)Q(s, \u03c0(s))(cid:3),\n\nwhere s is sampled from Z. DDPG uses a stochastic policy \u02c6\u03c0(s) = \u03c0(s) + N for exploration, where\nN is a noise process. N can be either normally distributed or generated by the Ornstein-Uhlenbeck\n(OU) process [21].\n\n2.3 On-Policy Methods\n\nOn-policy methods update their value functions based on the samples generated by the current policy.\nWe review a state-of-the-art on-policy method called Advantage Actor-Critic (A2C) [7, 23], which is\nevaluated and compared to the proposed methodology in Section 4. A2C is a synchronous variant\nof Asynchronous Advantage Actor-Critic (A3C) [7], which trains agents in parallel, on multiple\ninstances of the environment. A2C offers better utilization of GPUs than A3C. Similarly, the critic\nestimates the value function V (s). The actor optimizes \u03c0(a|s, \u03b8) by minimizing the following loss\nfunction:\n\n(cid:2)Gt \u2212 V (s) + \u03b2H(\u03c0(.|s, \u03b8))(cid:3),\n\n(3)\nwhere \u03b2 is a hyperparameter for controlling the strength of the entropy H(\u03c0(.|s, \u03b8)). A2C uti-\nlizes H(\u03c0(.|s, \u03b8)) to encourage exploratory behaviors, as well as prevent agents from converging\nprematurely to sub-optimal policies.\n\nLactor = \u2212Es,a\u223c\u03c0\n\n3 Diversity-Driven Exploration Strategy\n\nThe main objective of the proposed diversity-driven exploration strategy is to encourage a DRL agent\nto explore different behaviors during the training phase. Diversity-driven exploration is an effective\nway to motivate an agent to examine a richer set of states, as well as provide it with an approach to\nescape from sub-optimal policies. It can be achieved by modifying the loss function LD as follows:\n(4)\nwhere L indicates the loss function of any arbitrary DRL algorithms, \u03c0 is the current policy, \u03c0(cid:48) is a\npolicy sampled from a limited set of the most recent policies \u03a0(cid:48), D is a distance measure between \u03c0\nand \u03c0(cid:48), and \u03b1 is a scaling factor for D. The second term in Eq. (4) encourages an agent to update\n\u03c0 with gradients towards directions such that \u03c0 diverges from the samples in \u03a0(cid:48). Eq. (4) provides\nseveral favorable properties. First, it drives an agent to proactively attempt new policies, increasing\nthe opportunities to visit novel states even in the absence of reward signals from E. This property\nis especially useful in sparse reward settings, where the reward is zero for most of the states in\n\nLD = L \u2212 E\u03c0(cid:48)\u2208\u03a0(cid:48)[\u03b1D(\u03c0, \u03c0(cid:48))],\n\n3\n\n\fS. Second, the distance measure D motivates exploration by modifying an agent\u2019s current policy\n\u03c0, instead of altering its behavior randomly. Third, it allows an agent to perform either greedy or\nstochastic policies while exploring effectively in the training phase. For greedy policies, since D\nrequires an agent to adjust \u03c0 after each update, the greedy action for a state may change accordingly,\npotentially directing the agent to explore unseen states. This property also ensures that the agent acts\nconsistently in the states it has been familiar with, as \u03c0 and \u03c0(cid:48) yield the same outcomes for those\nstates. These three properties allow a DRL agent to explore an environment in a systematic and\nconsistent manner. The choice of D can be KL-divergence, L2-norm, or mean square error (MSE).\nThe interested reader is referred to our supplementary material for more details of how the distance\nmeasure is selected. In the subsequent sections, we explain how diversity-driven exploration can be\ncombined with off-policy and on-policy methods, for both discrete and continuous control problems.\n\n3.1\n\nImplementation on Off-Policy Methods\n\nMost off-policy DRL algorithms [1, 21] adopt the experience replay mechanism to stabilize the learn-\ning process. We show that diversity-driven exploration can be readily applied to existing algorithms\nfor both discrete (DQN) and continuous (DDPG) control tasks, with only a few modi\ufb01cations to the\nexperience replay buffer Z.\n\nDiv-DQN. We make the following changes to the DQN algorithm. First, we additionally store\nthe past Q-values (denoted as Q(cid:48)(s, a)) in Z. Second, for the sake of de\ufb01ning a proper distance\nmeasure, we use a probabilistic formulation as in [5] by applying the softmax function over the\npredicted Q-values. We therefore de\ufb01ne \u03c0(a|s) = exp(Q(s, a))/\u03a3a(cid:48)\u2208A exp(Q(s, a(cid:48))). \u03c0(cid:48)(a|s) is\nde\ufb01ned similarly but uses Q(cid:48)(s, a) instead. We adopt KL-divergence as the distance measure, denoted\nas DKL. Eq. (4) is rewritten as:\n\nLD = L \u2212 E \u02c6Q(s,a)\u223cU (Z)[\u03b1DKL(\u03c0(cid:48)(.|s)||\u03c0(.|s))],\n\n(5)\n\nwhere \u03b1 can be either a prede\ufb01ned value, or determined by the adaptive methods in Section 3.3.\n\nDiv-DDPG. For diversity-driven DDPG, we use the actions stored in Z, and modify Eq. (4) as:\n\nLD = L \u2212 Es,a(cid:48)\u223cU (Z)[\u03b1D(\u03c0(s), a(cid:48))],\n\n(6)\nwhere a(cid:48) is a prior action sampled from Z. The distance measure D here is simply an MSE function\nbetween \u03c0(s) and a(cid:48). The value of \u03b1 is determined in a similar fashion as that of DQN. Please\nnote that we do not use the stochastic exploration policy mentioned in Section 2.2, since D alone is\nsuf\ufb01cient enough for improving the exploratory behavior.\n\n3.2\n\nImplementation on On-Policy Methods\n\nApart from off-policy methods, we explain how our methodology can be applied to on-policy methods.\n\nDiv-A2C. As A2C has no experience replay, we maintain the n most recent policies for calculating\nthe distance measure D. In general cases, n = 5 is suf\ufb01cient to yield satisfactory performance. The\nloss function LD is thus expressed as:\n\n(7)\n\nwhere \u03c4 represents the batch of on-policy data,(cid:81) is the set of prior polices, and \u03c0(cid:48) is a prior policy.\n\nLD = L \u2212 Es\u223c\u03c4 [E\u03c0(cid:48)\u223c(cid:81)[\u03b1\u03c0(cid:48)DKL(\u03c0(cid:48)(.|s)||\u03c0(.|s))]],\n\nKL-divergence is used as the distance measure. We describe in detail a performance-based method to\nscale \u03b1\u03c0(cid:48) for each individual prior policy \u03c0(cid:48) in Section 3.3.\n\n3.3 Adaptive Scaling Strategy\n\nAlthough \u03b1 can be linearly annealed over time, we \ufb01nd this solution less than ideal in some cases.\nWe demonstrate these cases in experimental results provided in Section 4. To update \u03b1 in a way that\nleads to better overall performance, we consider two adaptive scaling methods: the distance-based\nand the performance-based methods. In our experiments, the off-policy algorithms use only the\ndistance-based method, while the on-policy algorithm uses both methods for scaling \u03b1.\n\n4\n\n\fDistance-based. Similar to [5], we relate \u03b1 to the distance measure D. We adaptively increase or\ndecrease the value of \u03b1 depending on whether D is below or above a certain threshold \u03b4. The simple\napproach we use to update \u03b1 for each training iteration is de\ufb01ned as:\n\n(cid:26)1.01\u03b1,\n\nif E(cid:2)D(\u03c0, \u03c0(cid:48)(cid:1)] \u2264 \u03b4\n\n0.99\u03b1, otherwise\n\n\u03b1 :=\n\n.\n\n(8)\n\nPlease note that different values of \u03b4 are applied to different methods in our experiments. The method\nfor determining the value of \u03b4 is described in the supplementary material.\n\nPerformance-based. While the distance-based scaling method is straightforward and effective, it\nalone does not lead to the same performance for on-policy algorithms. The rationale behind this is\nthat we only use the \ufb01ve most recent policies (n = 5) to compute LD, which often results in high\nvariance, and instability during the training phase. Off-policy algorithms do not suffer from this issue,\nas they can utilize experience replay to provide a suf\ufb01ciently large set of past policies. Therefore, we\npropose to further adjust the value of \u03b1 for on-policy algorithms according to the performance of past\npolicies to stabilize the learning process. We de\ufb01ne \u03b1i in either one of the following two strategies:\n\n\u03b1\u03c0(cid:48) := \u2212(2(\n\nP (\u03c0(cid:48)) \u2212 Pmin\nPmax \u2212 Pmin\n\n) \u2212 1) (Proactive); \u03b1\u03c0(cid:48) := 1.0 \u2212 P (\u03c0(cid:48)) \u2212 Pmin\nPmax \u2212 Pmin\n\n(9)\nwhere P (\u03c0(cid:48)) denotes the average performance of \u03c0(cid:48) over \ufb01ve episodes, and Pmin and Pmax represent\nthe minimum and maximum performance attained by the set of past policies \u03a0(cid:48). Note that \u03b1\u03c0(cid:48) falls\nin the interval [\u22121, 1] for the proactive strategy, and [0, 1] for the reactive one. The proactive strategy\nincentivizes the current policy \u03c0 to converge to the high-performing policies in \u03a0(cid:48), while keeping\naway from the poor ones. On the other hand, the reactive strategy only motivates \u03c0 to stay away from\nthe underperforming policies. We provide a comprehensive ablative analysis for these strategies in\nSection 4. Note that we apply both Eq. (8) and Eq. (9) to the on-policy methods in our experiments.\n\n(Reactive),\n\n3.4 Clipping of Distance Measure\n\nIn some cases, the values of D become extraordinarily high, causing instability in the training phase\nand thus degrading the performance. To stabilize the learning process, we clip D to be between \u2212c\nand c, where c is a prede\ufb01ned constant. Please refer to our supplementary material for more details.\n\n4 Experiments\n\nIn this section, we present our experimental results and discuss their implications. We \ufb01rst provide\nan overview of the experimental setup and the environments we used to evaluate our models in\nSection 4.1. In Sections 4.2\u223c4.4, we report results in three different environments, respectively. We\nfurther provide an ablative analysis for the proposed methodology in the supplementary material.\n\n4.1 Experimental Setup\n\n4.1.1 Environments\nGridworld. To provide an illustrative example of our method\u2019s effectiveness, we create a huge 2D\ngridworld (Fig. 1) with two different settings: (1) sparse reward setting, and (2) deceptive reward\nsetting. In both settings, the agent starts from the top-left corner of the map, with an objective to reach\nthe bottom-right corner to obtain a reward of 1. At each timestep, the agent observes its absolute\ncoordinate, and chooses from four possible actions: move north, move south, move west, and move\neast. An episode terminates immediately after a reward is collected. In the deceptive reward setting\nillustrated in Fig. 1 (a), the central area of the map is scattered with small rewards of 0.001 to distract\nthe agent from \ufb01nding the highest reward in the bottom-right corner. On the other hand, in the sparse\nreward setting depicted in Fig. 1 (b), there is only a single reward located at the bottom-right corner.\n\nAtari 2600. For discrete control tasks, we perform experiments in the Arcade Learning Environment\n(ALE) [20]. We select eight games varying in their dif\ufb01culty of exploration according to [24]. In\neach game, the agent receives 84 \u00d7 84 \u00d7 4 stacked grayscale images as inputs, as described in [1].\n\n5\n\n\fTable 1: Evaluation results of the gridworld experiments.\n\nDeceptive Reward\n50 \u00d7 50 100 \u00d7 100 200 \u00d7 200\n0.010\n0.009\n0.202\n\n0.010\n0.010\n0.208\n\n0.010\n0.004\n0.604\n\nSparse Reward\n50 \u00d7 50 100 \u00d7 100 200 \u00d7 200\n0.300\n0.400\n1.000\n\n0.000\n0.000\n1.000\n\n0.100\n0.200\n1.000\n\nVanilla-DQN\nNoisy-DQN\nDiv-DQN\n\nVanilla-DQN\nNoisy-DQN\nDiv-DQN\n\nMuJoCo. For continuous control tasks, we conduct experiments in environments built on the\nMuJoCo physics engine [4]. We select a number of robotic control tasks to evaluate the performance\nof the proposed methodology and the baseline methods. In each task, the agent takes as input a vector\nof physical states, and generates a vector of action values to manipulate the robots in the environment.\n\n4.1.2 Baseline Methods\n\nThe baseline methods (or simply \u201cbaselines\") adopted for comparison vary within different environ-\nments. For discrete control tasks, we select vanilla DQN [1], vanilla A2C [7], as well as their noisy\nnet [8] and Curiosity-driven [14] variants (denoted by Noisy-DQN/A2C and Curiosity-DQN/A2C,\nrespectively) as the baselines. For continuous control tasks, vanilla DDPG [21] and its parameter\nnoise [5] variant (referred to as parameter noise DDPG) are taken as the baselines for comparison.\nAll of these baselines are implemented based on OpenAI Baselines1. For each method, we adopt the\nsetting that yields the highest overall performance during hyperparameter search. Our hyperparameter\nsettings are provided in the supplementary material. Please note that for DQN and A2C, we choose\ntheir noisy net variants for comparison instead of the parameter noise ones, as the former variants\nlead to relatively better performance. On the other hand, we select parameter noise DDPG as our\nbaseline rather than the noisy net variant, as the authors of [8] do not provide their implementations.\n\n4.2 Exploration in Huge Gridworld\nIn this experiment, we evaluate different methods in 2D gridworlds with three different sizes: 50\u00d7 50,\n100 \u00d7 100, and 200 \u00d7 200. We consider only vanilla DQN and Noisy-DQN in this experiment. We\nreport the performance of each method in terms of their average rewards in Table. 1, where the\naverage rewards are evaluated over the last ten episodes. We plot the state-visitation counts of all\nmethods (Fig. 2) on 200 \u00d7 200 gridworlds, in order to illustrate how agents explore the state space.\n\nDeceptive reward. Fig. 1 (a) illustrates the deceptive gridworld. As shown in Table. 1, Div-DQN\noutperforms both vanilla and Noisy-DQN in this setting. From the state-visitation counts (Fig. 2\n(a)(b)(c)), it can be observed that baseline methods are easily trapped in the area near the deceptive\nrewards, and have never visited the optimal reward in the bottom-right corner. On the other hand,\nit can be seen from Fig. 2 (c) that Div-DQN is able to escape from the area of deceptive rewards,\nexplores all of the four sides of the gridworld, and successfully discovers the optimal reward of one.\n\nSparse reward. Fig. 1 (b) illustrates the sparse gridworld. From Table. 1, we notice that the average\nrewards of Noisy-DQN increase slightly as compared to those in the deceptive reward setting. As\nthe size of gridworld increases, it fails to \ufb01nd the location of the reward. In contrast, Div-DQN\nconsistently achieves 1.0 mean reward for all the gridworld sizes. In addition, from Fig. 2 (d), it can\nbe seen that DQN spends most of its time wandering around the same route. Thus, its search range\ncovers only a small proportion of the state space. Noisy-DQN explores a much broader area of the\nstate space. However, the bright colors in Fig. 2 (e) indicates that Noisy-DQN wastes signi\ufb01cant\namount of time visiting explored states. On the other hand, Div-DQN is the only method that is\ncapable of exploring the gridworld uniformly and systematically, as illustrated in Fig. 2 (f). These\nresults validate that our methodology is superior in exploring large state spaces with sparse rewards.\nBased on the above results, we conclude that the proposed diversity-driven exploration strategy does\noffer advantages that are favorable in exploring large gridworlds with deceptive or sparse rewards.\n\n1https://github.com/openai/baselines\n\n6\n\n\fFigure 1: Gridworlds.\n\nFigure 2: State-visitation counts of the gridworlds.\n\nFigure 3: Comparison of learning curves for different DQN variants in Atari 2600.\n\n4.3 Performance Comparison in Atari 2600\n\nIn addition to the gridworld environments, we evaluate our methodology in a more challenging set of\nenvironments Atari 2600. We provide an in-depth analysis for the empirical results of a few games\nselected from both the hard and easy exploration categories, according to the taxonomy in [24]. Each\nagent is trained with 40M frames, and the performance is evaluated over three random seeds. In\nFigs. 3 and 4, we plot the in-training median scores, along with the interquartile range. Div-DQN\n(Linear) and Div-DQN (Dis) correspond to our diversity-driven DQN with the linear decay and the\ndistance-based methods for scaling \u03b1, respectively. Div-A2C (Pro) and Div-A2C (Rea) correspond to\nour diversity-driven A2C with the proactive and the reactive methods in Sec. 3.3, respectively.\n\n4.3.1 Hard Exploration Games\n\nFig. 3 plots the learning curves of all of the models in the training phase. It can be seen that our\nmethods demonstrate superior or comparable performance to the baseline methods in all games.\nParticularly, we observe that our strategy helps an agent explore a wider area more ef\ufb01ciently\ncompared to the other baselines, which is especially useful when the state spaces become suf\ufb01ciently\nlarge. For example, in Freeway, we notice that our agents are able to quickly discover the only\nreward at the other side of the road, while the other methods remain at the starting position. This\nobservation is consistent with the learning curves illustrated in Figs. 3 (a) and 4 (a), where Div-DQN\nand Div-A2C learns considerably faster and better than the baseline methods. In addition to the direct\nbene\ufb01ts of ef\ufb01cient and thorough exploration mentioned above, we also observe that the exploratory\nbehaviors induced by our methods are more systematic, motivating our agents explore unvisited\nstates. As shown in Fig. 3 (b), Div-DQN is the only one that learns a successful policy in Venture. On\nthe contrary, as the vanilla DQN, A2C, Noisy-DQN/A2C, and Curiosity-DQN/A2C agents explore in\na random way, they often bump into monsters they have previously seen and are quickly killed. It\nshould be noted that although Div-A2C does not demonstrate a substantial increase in performance,\nit can be seen in Table S1 that it does end up with higher rewards than the baselines. We also observe\nthat our method helps an agent ignore known small (deceptive) rewards, and discover alternative ways\nto obtain the optimal reward. For instance, in Alien, our agents learn to collect rewards while avoiding\naliens by detouring, while the baselines focus on the immediate rewards in front of them without\ntaking aliens into consideration. This enables our agents obtain higher average rewards than the\n\n7\n\n\fFigure 4: Comparison of learning curves for different A2C variants in Atari 2600.\n\nFigure 5: Comparison of learning curves for different DDPG variants in MuJoCo.\n\nbaselines. In summary, our methods bring an improvement in terms of scores and training ef\ufb01ciency\nfor hard exploration games except Montezuma\u2019s Revenge, which is notorious for its complexity.\n\n4.3.2 Easy Exploration Games\n\nWe show that the proposed diversity-driven exploration strategy can improve the training ef\ufb01ciency\nfor easy exploration games as well. From the learning curves of Enduro and BankHeist presented in\nFigs. 3 and 4, it can be seen that our Div-DQN and Div-A2C agents learn signi\ufb01cantly faster than the\nbaselines in both games. They also show superior performance to the baselines for most of the time.\n\n4.4 Performance Comparison in MuJoCo Environments\n\nTo evaluate the proposed methods in continuous control domains, we conduct experiments in MuJoCo\nenvironments with (1) deceptive rewards, (2) large state space, and (3) sparse rewards, and similarly\nplot the in-training median scores in Fig. 5. In Fig. 5, DDPG (Div-Linear) and DDPG (Div-Dis)\nrepresent Div-DDPG with linear decay and distance-based scaling, respectively. DDPG (OU Noise)\nand DDPG (Param Noise) stand for vanilla DDPG and parameter noise DDPG, respectively. We\ninvestigate and discuss how diversity-driven loss in\ufb02uences the agents\u2019 behavior in these environments,\nand demonstrate that our methodology does lead to superior exploration ef\ufb01ciency and performance.\n\n4.4.1 Environments with Deceptive Rewards\n\nFigs. 5 (a) and (b) plot the learning curves of Div-DDPG and vanilla DDPG in environments with\ndeceptive rewards. In both cases, it is observed that our agents learn considerably faster, and end up\nwith higher average rewards than the baselines. While vanilla DDPG often converges to suboptimal\npolicies, Div-DDPG is able to escape from local optimum and \ufb01nd better strategies for larger payoffs.\nFor example, in Humanoid, the baseline agent learns to lean forward for rewards, but at an angle that\nmakes it easily fall down. Although our agent initially behaves in a similar way, it later discovers\nan alternate policy, and successfully walks forward for a much longer period without falling over.\n\n8\n\n\fSimilarly, in HalfCheetah, Div-DDPG agent acts in the same way as the vanilla agent initially, but it\nultimately learns to balance itself and moves forward swiftly. These results indicate that our method\ndoes help agents explore more states, increasing their chances to escape from suboptimal policies.\n\n4.4.2 Environments with Large State Spaces\n\nFigs. 5 (c), (d), and (e) plot the learning curves of Div-DDPG and DDPG in environments with\nlarge state spaces. It can be seen that our methods learn signi\ufb01cantly faster than baseline methods in\nPusher, Thrower, and Strike. In these environments, agents have to manipulate a robotic arm to push,\nthrow, or hit a ball to goal areas, respectively. Even though these environments provide well-de\ufb01ned\nreward functions, it is still challenging for agents to learn feasible policies, as they have to explore\nthe enormous state spaces. We observe that the baseline methods move the arms aimlessly, and rarely\nreach the goals. Moreover, the random perturbation in their behaviors makes it harder for them to\npush/throw/hit the ball to the goal. In contrast, without the interference of action noise during training,\nour methods can quickly learn to manipulate the arm correctly, and hence result in higher rewards.\n\n4.4.3 Environments with Sparse Rewards\n\nWe rede\ufb01ne the reward functions of Reacher and HalfCheetah and create the SparseReacher and\nSparseHalfCheetah environments to investigate the impact of sparse rewards on the performance\nof Div-DDPG, parameter noise DDPG, and vanilla DDPG. In SparseReacher, a reward of +1 is\nonly granted when the distance between the actuator and the target is below a small threshold. In\nSparseHalfCheetah, an agent is only rewarded with +1 when they move forward over a distance\nthreshold. In Fig. 5 (f), we report the performance of each method in SparseReacher. While all of the\nmethods are able to succeed in these environments, it can be noticed that Div-DDPG learns faster,\nand achieves higher average rewards. In Fig. 5 (g), we show the performance of each methods in\nSparseHalfCheetah with the distance threshold set to 15.0. It can be observed that Div-DDPG is the\nonly method that is able to acquire a stable policy for this setting. In contrast, vanilla DDPG and\nparameter noise DDPG rarely exceed the distance threshold, and receive no reward most of the time.\nFrom these results, we conclude that our methods equip agents with the ability to explore ef\ufb01ciently\nin continuous control environment, and achieve promising results in various challenging settings.\n\n5 Related Work\n\nAs our diversity-driven exploration strategy relates to several prior works in RL, this section provides\na comprehensive comparison with those researches in terms of objectives and implementations.\n\nEntropy regularization for RL. This line of works attempts to alleviate the premature convergence\nproblem in policy search by regularizing the learning process with information-theoretic entropy\nconstraints. In [25], the authors address this problem by constraining the relative entropy between\nold and new state-action distributions. Similarly, a few recent works [26, 27] propose to alleviate this\nproblem by bounding the KL-divergence between prior and current policies. In terms of objectives,\nour method aims to improve the insuf\ufb01cient exploration problem in deceptive and sparse reward\nsettings, rather than addressing the premature convergence problem in policy learning. Regarding\nimplementations, both the above works and ours impose entropy-related constraints during learning.\nHowever, our method encourages exploration by maximizing the distance measure between the\ncurrent and prior policies, instead of restraining their state-action distribution or KL-divergence.\n\nMaximum entropy principle for RL. This series of works aim to improve the performance of\nexploration under uncertain dynamics by optimizing a maximum entropy objective. In [28, 29], the\nauthors construct this objective function by augmenting the reward function with policy entropy of\nvisited states. Maximizing the expected entropy and rewards jointly encourages an RL agent to act\noptimally while retaining the stochasticity of its policy. This stochasticity in policy enhances the\nperformance of exploration under uncertain dynamics. In respect of objectives, the works in [28, 29]\nand our approach all intend to ameliorate the performance and ef\ufb01ciency of exploration. However,\nour work focuses more on environments with deceptive and sparse rewards, rather than those with\nuncertain dynamics. In terms of implementations, our method maximizes the distance measure\nbetween old and new policies, instead of maximizing the expected entropy of an agent\u2019s policy.\n\n9\n\n\fTo conclude, our method differs from the previous works in several fundamental aspects. It enhances\nthe ef\ufb01ciency of exploration with a novel loss term. To our best knowledge, our work is the \ufb01rst one\nto encourage exploration by maximizing the distance measure between current and prior policies.\n\n6 Conclusion\n\nIn this paper, we presented a diversity-driven exploration strategy, which can be effectively combined\nwith current RL algorithms. We proposed to promote exploration by encouraging an agent to engage\nin different behaviors from its previous ones, and showed that this can be easily achieved through\nthe use of an additional distance measure term to the loss function. We performed experiments in\nvarious benchmark environments and demonstrated that our method leads to superior performance\nin most of the settings. Moreover, we veri\ufb01ed that the proposed approach can deal with sparsity\nand deceptiveness in the reward function, and explore in large state spaces ef\ufb01ciently. Finally, we\nanalyzed the adaptive scaling methods, and validated that the methods do improve the performance.\n\nAcknowledgment\n\nThe authors would like to thank Ministry of Science and Technology (MOST) in Taiwan and MediaTek\nInc. for their funding support, and NVIDIA Corporation and NVAITC for their support of GPUs.\n\n10\n\n\fReferences\n[1] V. et al. Mnih. Human-level control through deep reinforcement learning. Nature, vol. 518, no.\n\n7540, pp. 529-533, February 2015.\n\n[2] D. et al. Silver. Mastering the game of Go with deep neural networks and tree search. Nature,\n\nvol. 529, no. 7587, pp. 484-489, January 2016.\n\n[3] M. et al. Zhang. Learning deep neural network policies with continuous memory states. In\n\nProc. Int. Conf. Robotics and Automation (ICRA), pages 520\u2013527, May 2016.\n\n[4] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Proc.\n\nInt. Conf. Intelligent Robots and Systems (IROS), pages 5026\u20135033, December 2012.\n\n[5] M. et al. Plappert. Parameter space noise for exploration.\n\nRepresentations (ICLR), May 2018.\n\nIn Proc. Int. Conf. Learning\n\n[6] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.\n\n[7] V. et al. Mnih. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf.\n\nMachine Learning (ICML), pages 1928\u20131937, June 2016.\n\n[8] M. et al. Fortunato. Noisy networks for exploration. In Proc. Int. Conf. Learning Representations\n\n(ICLR), May 2018.\n\n[9] I. Osband, D. Russo, Z. Wen, and B. Van Roy. Deep exploration via randomized value functions.\n\narXiv:1703.07608, March 2017.\n\n[10] R. et al. Houthooft. Vime: Variational information maximizing exploration. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 1109\u20131117, December 2016.\n\n[11] H. et al. Tang. #Exploration: A study of count-based exploration for deep reinforcement\nlearning. In Advances in Neural Information Processing Systems (NIPS), pages 2750\u20132759,\nDecember 2017.\n\n[12] A. et al. van den Oord. Conditional image generation with pixelcnn decoders. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 4790\u20134798, December 2016.\n\n[13] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with\n\ndeep predictive models. arXiv:1507.00814, November 2015.\n\n[14] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-\nsupervised prediction. In Proc. Int. Conf. Machine Learning (ICML), pages 2778\u20132787, August\n2017.\n\n[15] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN.\nIn Advances in Neural Information Processing Systems (NIPS), pages 4026\u20134034, December\n2016.\n\n[16] J. Lehman and K. O. Stanley. Abandoning objectives: Evolution through the search for novelty\n\nalone. Evolutionary computation, 19(2):189\u2013223, May 2011.\n\n[17] J. Lehman and K. O. Stanley. Evolving a diversity of virtual creatures through novelty search\nand local competition. In Proc. Conf. Genetic and Evolutionary Computation, pages 211\u2013218,\nJuly 2011.\n\n[18] J. Lehman and K. O. Stanley. Novelty search and the problem with objectives. Genetic\n\nProgramming Theory and Practice IX, pages 37\u201356, October 2011.\n\n[19] E. et al. Conti. Improving exploration in evolution strategies for deep reinforcement learning\n\nvia a population of novelty-seeking agents. arXiv:1712.06560, December 2017.\n\n[20] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research (JAIR), 47:\n253\u2013279, May 2013.\n\n11\n\n\f[21] T. P. et al. Lillicrap. Continuous control with deep reinforcement learning. arXiv:1509.02971,\n\nFebruary 2016.\n\n[22] D. et al. Silver. Deterministic policy gradient algorithms. In Proc. Int. Conf. Machine Learning\n\n(ICML), pages 387\u2013395, June 2014.\n\n[23] Y. et al. Wu. Scalable trust-region method for deep reinforcement learning using Kronecker-\nfactored approximation. In Advances in Neural Information Processing Systems (NIPS), pages\n5285\u20135294, December 2017.\n\n[24] M. et al. Bellemare. Unifying count-based exploration and intrinsic motivation. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 1471\u20131479, December 2016.\n\n[25] Jan Peters, Katharina M\u00fclling, and Yasemin Altun. Relative entropy policy search. In As-\nsociation for the Advancement of Arti\ufb01cial Intelligence (AAAI), pages 1607\u20131612. Atlanta,\n2010.\n\n[26] J. et al. Schulman. Trust region policy optimization. In Proc. Int. Conf. Machine Learning\n\n(ICML), pages 1889\u20131897, July 2015.\n\n[27] J. et al. Schulman. Proximal policy optimization algorithms. arXiv:1707.06347, August 2017.\n\n[28] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy\n\ndeep reinforcement learning with a stochastic actor. arXiv:1801.01290, January 2018.\n\n[29] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy\nmaximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and\nAndreas Krause, editors, Proceedings of the 35th International Conference on Machine Learn-\ning, volume 80 of Proc. of Machine Learning Research, pages 1861\u20131870, Stockholmsm\u00e4ssan,\nStockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n12\n\n\f", "award": [], "sourceid": 6722, "authors": [{"given_name": "Zhang-Wei", "family_name": "Hong", "institution": "National Tsing Hua University"}, {"given_name": "Tzu-Yun", "family_name": "Shann", "institution": "University of British Columbia"}, {"given_name": "Shih-Yang", "family_name": "Su", "institution": "Virginia Tech"}, {"given_name": "Yi-Hsiang", "family_name": "Chang", "institution": null}, {"given_name": "Tsu-Jui", "family_name": "Fu", "institution": "Academia Sinica"}, {"given_name": "Chun-Yi", "family_name": "Lee", "institution": "National Tsing Hua University"}]}