Applying Reinforcement Learning to the Traveling Salesman Problem

reinforcement learning
Author
Affiliation

Victor De Lima

Georgetown University

Published

April 27, 2023

Abstract
The Traveling Salesman Problem (TSP) is a well-studied optimization problem with a wide variety of applications and approaches to solving it. The methods and tools of Reinforcement Learning (RL) offer a distinctive strategy for approaching the TSP due to the ease with which its reward structure can be modified. Using data obtained from Skyscanner, this study examines how Q-learning, an RL technique, can optimize a travel route through 42 Asian cities. We first provide a theoretical background into the tools discussed, followed by an exposition of the methodology and experiments. The study’s results show that Q-learning is a very effective method for solving the TSP. We also discuss the model’s limitations and how it can be extended in future work.

All code used in this report is publicly available on GitHub.

Figure 1: Picture of a robot travel agent generated with Stable Diffusion [1] using prompt “a friendly-looking humanoid robot travel agent greeting you on its work desk and a map hanging in the back wall

1 Introduction

Anyone who has planned a trip that involves visiting several cities will have undoubtedly found themselves wondering what is the best way to visit each location while minimizing the cost of the trip. Each location has several considerations, such as distance, cost, and the importance of visiting specific places. The question can become overwhelming with the realization that we would have to consider each city as both a potential starting point and a potential next stop from any other city. We can analyze such a scenario through the framework of the Traveling Salesman Problem (TSP). Karl Menger, one of the first to study the problem mathematically, defines the TSP as “the task to find, for finitely many points whose pairwise distances are known, the shortest route connecting the points” [2]. In other words, the goal of the TSP is to find the shortest possible route that visits every point once and then returns to the origin. The TSP is an NP-hard combinatorial optimization problem. It is simple to solve by brute force for five stops by trying all combinations (5! problem) but already impossible for 50 (50! problem).

The TSP is a well-known and widely studied problem in mathematics and computer science. Beyond travel, the TSP has applications in many fields, including manufacturing, transportation, and engineering. For example, some applications that successfully employ the TSP include overhauling gas turbine engines [3], X-Ray crystallography [4], and computer wiring [5]. Although researchers have studied the TSB extensively and developed several discrete optimization techniques to solve it, Reinforcement Learning (RL) offers many advantages compared to classical optimization methods. Most importantly, RL offers a framework in which it is relatively simple to change the reward structure to adjust the optimization problem under changing circumstances.

In this paper, we approach the TSP from an RL perspective and utilize Q-Learning to optimize a 42-city travel route using real prices obtained from SkyScanner. First, we discuss the theoretical background that motivates the experiment. Then, we explore the methodology and data on which we built the analysis. Next, we review the results, which show that Q-learning is a very effective method for solving the TSP. Lastly, we offer a discussion on the model’s limitations and the ways to extend the model in future research.

3 Methodology

3.1 Problem Formulation

In this paper, we employ the asymmetric case of the TSP since the costs associated with traveling from airport A to airport B are not the same as the costs of traveling from B to A, given differing fare rates. Furthermore, itineraries might include varying stopovers that lead to travel distances between the points that are not symmetrical in the model. We formulate the TSP as a set of locations \(V=\left\{v_1, \ldots \ldots, v_n\right\}\), a set of edges \(A=\{(r, s): r, s \in V\}\), and a set of costs associated with each edge \(d_{r s}=d_{s r}\), where each cost is associated with an edge \((r, s) \in A\). Since this is the asymmetric case, we allow for \(d_{r s} \neq d_{s r}\) for any edge \((r, s)\)[7].

In the Q-learning framework, we aim to to learn which action to take in each location (the state). To this end, the model learns the value associated with each transition from one location to another (referred to as Q-values) by sampling the environment and repeatedly applying the update rule for the Q-learning algorithm:

\[ \begin{align}\begin{aligned}Q(s, a) = Q(s, a) + \alpha * \\\begin{split} \left( r +\gamma \max_{a'} Q(s', a') - Q(s, a) \right) \end{split}\end{aligned}\end{align} \tag{1}\]

where \(Q(s, a)\) is the current estimate of the value of taking action \(a\) in state \(s\), \(r\) is the reward (cost) from taking an action, \(\alpha\) is a learning rate hyperparameter, and \(\gamma\) is a discount rate hyperparameter. States and actions denoted with the \(\prime\) symbol denote the next state or action. By applying Equation 1, the model can update the existing estimates of \(Q(s, a)\). Then, the agent takes the action with the highest Q-value at every step [14].

3.2 Data

We obtained the initial list of target cities and coordinate data from [15]. Pyongyang, Sanaa, Rangoon, Damascus, Kabul, and Thimphu were removed from the list due to limited flight availability, resulting in 42 cities for the study. Figure 2 shows the airport locations in each city.

Figure 2: Interactive Folium world map with overlay over Asia showing the cities of interest, with a tile by [16] and administrative boundaries from [17]. Note: VPNs may prevent correct map rendering.

We collected these cities’ flight prices, duration, and stopover data using the SkyScanner API [18]. We made several simplifying assumptions in the data collection. We only obtained data for the single, arbitrary date of 22 May 2023. Also, the data does not include the layover time between segments, only the in-flight time1. Lastly, the data collected only corresponds to economy cabin class. Relaxation of these assumptions can be explored in future work.

Given the 42 airports, we completed three transition matrices with 1,722 non-zero entries, each matrix corresponding to either flight price, duration, or stopover information. The entries correspond to each of the \(42 \times 42\) transitions minus the diagonal, which would correspond to traveling from an airport to itself (\(42 * 42 - 42 = 1,722\)). The search resulted in 38,236 segments, narrowed to 15,396 itineraries, and narrowed to the 1,722 ‘best’ flights. To choose the ‘best’ flight, we used the heuristic of valuing each flight hour at $20 and every stopover at $50, resulting in:

\[ \begin{flalign}\begin{aligned} \rm best \, price = fare + flight \,time \\\begin{split} \rm * 20 + stops * 50 \end{split}\end{aligned}\end{flalign} \tag{2}\]

Figure 3 shows color-coded prices between airports.

4 Experiments and Results

We implemented the model in Python using custom classes and functions. The model was based on the [19] implementation, particularly on the method for tracking previously visited locations and preventing revisiting them. The model constructed a matrix of Q-values of size \(42 \times 42\) and employed an e-greedy strategy with epsilon decay at a rate of 0.999. Finally, we ran the model through 200 episodes of Q-learning, with results available in Figure 4.

(a) Flight Score Results

(b) Fare Expense Results

(c) Flight Time Results

(d) Flight Stops Results
Figure 4: Experiment Results

The results obtained had several important improvements from the initial randomness, such as:

  • Decreasing overall flight expenditure from around $18,000 to $8000
  • Decreasing overall flight expenditure from around 20,000 minutes (333.3 hours or 7.93 average hours per flight) to 9000 (133.3 hours or 3.17 average hours per flight)
  • Decreasing overall stopover expenditure from around 100 to 60.

Furthermore, Figure 5 shows the chosen trajectories for the first and last episodes, allowing us to observe the choice changes from the untrained and trained models. t is challenging to discern the disparities between these figures through mere observation, underscoring the model’s proficiency in revealing mathematical advantages that may otherwise be arduous to infer by visual inspection alone.

Figure 5: Trayectories of the first and last episodes.

5 Conclusions

This study began by explaining the TSP and motivated the use of RL as a method for solving it. Then we reviewed the TSP literature and the wide variety of existing solutions, including heuristic, approximate, and RL methods. We then formalized the problem and explained how the Q-learning process works. Next, we covered how the data for the experiment was collected, the assumptions made, and visualized the search space. Lastly, we discussed the results from the experiment showing that Q-learning is a very effective method for solving the TSP, achieving significant gains in cost reduction and time minimization for the desired route through the Asian cities.

Future work may include the relaxation of several assumptions made in this implementation. For example, future implementations may include additional data to capture varying prices across the expected duration of the trip rather than utilizing prices from a single date. Also, more complicated choices of ‘best’ flight may be considered to explore their impact on the model.

References

[1]
Stability AI. Stable Diffusion 2-1. Hugging Face. 2023 [accessed 2023 Apr 25]. https://huggingface.co/spaces/stabilityai/stable-diffusion
[2]
Schrijver A. On the History of Combinatorial Optimization (Till 1960). In: Handbooks in Operations Research and Management Science. Vol. 12. Elsevier; 2005. p. 1–68. https://linkinghub.elsevier.com/retrieve/pii/S0927050705120015. doi:10.1016/S0927-0507(05)12001-5
[3]
Plante RD, Lowe TJ, Chandrasekaran R. The Product Matrix Traveling Salesman Problem: An Application and Solution Heuristic. Operations Research. 1987 [accessed 2023 Apr 30];35(5):772–783. https://www.jstor.org/stable/171228
[4]
Bland RG, Shallcross DF. Large travelling salesman problems arising from experiments in X-ray crystallography: A preliminary report on computation. Operations Research Letters. 1989 [accessed 2023 Apr 30];8(3):125–128. https://linkinghub.elsevier.com/retrieve/pii/0167637789900370. doi:10.1016/0167-6377(89)90037-0
[5]
Lenstra JK, Kan AHGR. Some Simple Applications of the Travelling Salesman Problem. Journal of the Operational Research Society. 1975 [accessed 2023 Apr 30];26(4):717–733. https://www.tandfonline.com/doi/full/10.1057/jors.1975.151. doi:10.1057/jors.1975.151
[6]
Karp RM. Reducibility among Combinatorial Problems. In: Miller RE, Thatcher JW, Bohlinger JD, editors. Complexity of Computer Computations. Boston, MA: Springer US; 1972. p. 85–103. http://link.springer.com/10.1007/978-1-4684-2001-2_9. doi:10.1007/978-1-4684-2001-2_9
[7]
Matai R, Singh S, Lal M. Traveling Salesman Problem: An Overview of Applications, Formulations, and Solution Approaches. In: Davendra D, editor. Traveling Salesman Problem, Theory and Applications. InTech; 2010. http://www.intechopen.com/books/traveling-salesman-problem-theory-and-applications/traveling-salesman-problem-an-overview-of-applications-formulations-and-solution-approaches. doi:10.5772/12909
[8]
Little JDC, Murty KG, Sweeney DW, Karel C. An Algorithm for the Traveling Salesman Problem. Operations Research. 1963 [accessed 2023 May 1];11(6):972–989. https://pubsonline.informs.org/doi/10.1287/opre.11.6.972. doi:10.1287/opre.11.6.972
[9]
Johnson DS, McGeoch LA. The traveling salesman problem: A case study in local optimization. In: Local Search in Combinatorial Optimization. John Wiley; Sons; 1997.
[10]
Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by Simulated Annealing. Science. 1983 [accessed 2023 May 1];220(4598):671–680. https://www.science.org/doi/10.1126/science.220.4598.671. doi:10.1126/science.220.4598.671
[11]
Brady RM. Optimization strategies gleaned from biological evolution. Nature. 1985 [accessed 2023 May 1];317(6040):804–806. http://www.nature.com/articles/317804a0. doi:10.1038/317804a0
[12]
Ottoni ALC, Nepomuceno EG, Oliveira MSD, Oliveira DCRD. Reinforcement learning for the traveling salesman problem with refueling. Complex & Intelligent Systems. 2022 [accessed 2023 May 1];8(3):2001–2015. https://link.springer.com/10.1007/s40747-021-00444-4. doi:10.1007/s40747-021-00444-4
[13]
Bogyrbayeva A, Yoon T, Ko H, Lim S, Yun H, Kwon C. A deep reinforcement learning approach for solving the Traveling Salesman Problem with Drone. Transportation Research Part C: Emerging Technologies. 2023 [accessed 2023 May 1];148:103981. https://linkinghub.elsevier.com/retrieve/pii/S0968090X22003941. doi:10.1016/j.trc.2022.103981
[14]
Bilgin E. Mastering Reinforcement Learning with Python. Packt Publishing; 2020.
[15]
FlagPictures.com. List of Countries Capital Cities. FlagPictures. 2023 [accessed 2023 Apr 25]. https://www.flagpictures.com/world-capital-cities/
[16]
National Geographic, Esri, Garmin, HERE, UNEP-WCMC, USGS, NASA, ESA, METI, NRCAN, et al. Esri.NatGeoWorldMap. Leaflet-providers preview. 2023 [accessed 2023 Apr 25]. http://leaflet-extras.github.io/leaflet-providers/preview/#filter=Esri.NatGeoWorldMap
[17]
Opendatasoft. World Administrative Boundaries - Countries and Territories. Opendatasoft. 2022 [accessed 2023 Apr 26]. https://public.opendatasoft.com/explore/dataset/world-administrative-boundaries/information/
[18]
Skyscanner. API Developer Documentation. Skyscanner Travel APIs. 2023 [accessed 2023 Apr 26]. https://developers.skyscanner.net/docs/intro
[19]
Alves Da Costa T. Delivery optimization with Reinforcement Learning. GitHub. 2019 [accessed 2023 Apr 25]. https://github.com/TheoLvs/reinforcement-learning

Footnotes

  1. For clarification, we cannot take the difference between the first segment departure and last segment arrival to calculate the total itinerary time because flight data is provided in local time, which would not account for time zone changes.↩︎