Reinforcement Learning in the Warehousing Industry
A reflective assessment of the practical applications of Reinforcement Learning in the Warehousing and Logistics industry

Introduction
Artificial Intelligence and Machine Learning is advancing at an ever-increasing rate. Reinforcement Learning (RL) is one area of Machine Learning which is proving to be incredibly promising for the future of business efficiency and optimisation. Within the Warehousing and Logistics industry, there are some unique challenges, some of which can be addressed and improved with the application of Reinforcement Learning. One of these examples is the Picking and Putaway strategies which are implemented within modern Warehouse Management systems. If a Reinforcement Learning algorithm were to be developed to address this scenario, the benefits to businesses would improve efficiency and profitability. However, Reinforcement Learning has some nuanced difficulties which will need to be handled when scaling a solution like this to a production-ready environment.
What Is Reinforcement Learning?

Reinforcement Learning is a form of Machine Learning which enables the algorithm to make decisions and take actions within a given environment. The model then learns to make appropriate decisions through repeated trial-and-error processes and some intelligent computational optimisations. It requires three things: an environment (where the decisions can take place), an agent (which makes decisions within the environment), and feedback (which either rewards or penalises the agent for the decision). Some practical applications of Reinforcement Learning include the ability to play electronic games (for example, the suite of Atari games), and also to play board games (such as Backgammon and Chess). Reinforcement Learning crossed a major milestone when in 2016 when it beat a World Champion in the game of Go. This is impressive because some people have theorised that there are more moves in the game of Go then there are atoms on Earth. The opportunities for this form of machine learning are practically endless.
What is Currently Happening in the Logistics Industry?
Some of the Warehouse Management Systems (WMS) available today include: Infor, Manhattan, and SAP. While these systems are excellent at performing warehouse tasks (like putaway, picking, replenishments, etc), and are good at managing inventory, and can be very user-friendly, the systems themselves are not very intelligent. They are heavily reliant on some very complicated configurations, and contain a large number of screens with a large array of settings and options. While this was intended to make the system flexible and adaptable, the end result is that the system is quite complicated and cumbersome.
Necessarily, the operation of these systems places a heavy reliance on application experts within the business who understand the nuances of the system, who can troubleshoot issues, and who know how to set up complicated processes. Which is fine, up until the point where those experts leave the business for whatever reason, and the company will be forced to re-hire highly skilled experts, or rely on expensive external consultation.
It is entirely possible to configure the WMS with different strategies and processes, which can then automate different aspects of the warehouse operation. However, these are effectively extremely complex implementations of multi-level if-then processes, and are heavily reliant on strict business rules to be implemented. Furthermore, once they are set, these rules and strategies do not change. Which is fine if they don’t need to change. However, businesses continually need to refine their operation to increase efficiency and decrease operational costs. Therefore, having these complicated automation strategies in place can cause down-stream inefficiencies and issues, resulting from decaying optimisation of the original settings. These issues are exacerbated when the system experts have departed the business, and the configured warehouse strategies are no longer able to be troubleshot.
How Can Reinforcement Learning Help?
Just like a pixel screen in an Atari game, or a checkerboard in a game of Chess, a warehouse can be conceptualised as a pixelated environment which is then able to be interpretable by a Reinforcement Learning algorithm. The implementation of the algorithm would be intended to place the inventory within the warehouse in the most efficient manner, so as to reduce the amount of movement and re-work necessary for that stock item. As a result, the business would not need to rely on complicated an inflexible automation strategies, but instead the Reinforcement Learning algorithm would learn the most efficient strategy and would continually optimise as time goes on. Therefore, the system experts could focus on more important business optimisation tasks, and less time would be spent on operational re-work.
Is There A Specific Example?
Yes. First, let’s create our own virtual warehouse. Refer to the below example of a virtual warehouse.

This warehouse has the following attributes:
- Warehouse area of 20x10 squares
- 1 Put Desk
- 1 Pick Desk
- 42 Rack locations (set up in three isles of 14, with a gap in the middle)
- Three different Stock Keeping Units (SKU’s): SKU_A, SKU_B, and SKU_C
- Each location can only fit one SKU
- The shelves are in fixed positions, and the employees move around the warehouse
- There is no zone segmentation (ie. Any SKU can be placed in any location)
Warehouse actions:
- Goods are received in to the Put Desk
- From the Put Desk, goods are away to the Rack
- From the Racks, goods are picked and placed on the Pick Desk
- Goods are shipped from the Pick Desk
Now, having defined this, we have our environment which the Reinforcement Learning algorithm can use to operate within. The next step is to define the actions that can be made and the associated rewards.
Let’s define the Agents. There are two independent Agents:
Putaway Agent:
- Can put away to any location.
- Can only carry one SKU at a time.
Pick Agent:
- Can only pick from the closest location (ie. Will not skip one SKU to pick one from a location that is further away).
- Can only carry one SKU at a time.
Now, let us define the action. Considering that the Pick Agent is constrained, and can’t make any choices of its own, the optimisation decision is in the hands of the Putaway Agent. It can decide to place any SKU in any location. It is this decision that will be optimised.
The penalty that should be used for optimisation is the number of steps that shall be performed by the Pick Agent for all of it’s pick processes. In the given example, for the three SKU’s that were picked, the total steps taken was 21 steps. Now, notice that the ‘decision’ is in the hands of the Putaway Agent, but the ‘penalty’ is in the hands of the Pick Agent. Therefore, this means that the agents must work together to reduce the overall number of steps and therefore improve the efficiency of the warehouse.
However, there are some unknowns that the algorithm must learn. The primary of which is the ‘speed’ of the SKU’s. This speed is generally considered as how often that particular item is shipped. Take, for example, that SKU_A is an incredibly popular item (let’s say, an iPhone), and SKU_B is also very popular (let’s say, a Galaxy phone), but SKU_C however is not as popular (let’s say, a Nokia phone). People will still buy SKU_C, but not as often as SKU_A and SKU_B. So therefore, we will say that SKU_A and SKU_B are ‘faster’ than SKU_C.
As a result, the Reinforcement Learning algorithm will need to learn the speed of these SKU’s over time, so that it can optimise the placement of the fast SKU’s at the front, and the slow SKU’s at the back, so the net-net number of steps taken by the Pick Agent is at a minimum.
Can This Be Implemented Right Now?
While conceptually, this model appears to be logical and efficient, in reality the challenge is not that simple. There are a number of nuances which occur within a warehouse which make this solution difficult. Including:
- Warehouses are 3D, with hundreds or even thousands of possible locations;
- There could be thousands or even hundreds-of-thousands of SKU’s in a warehouse;
- Each location can house multiple SKU’s;
- The SKU’s can be of different dimensions;
- The SKU’s can have different attributes (eg. dangerous goods, differing weights, restricted substances, sensitive to temperature, etc.);
- There could be different zoning-configurations, or walls, with warehouses which would need to be navigated;
- Sometimes the SKU’s need to be moved between locations before they are Picked;
- There are material-handling equipment (eg. forklifts) and people in amongst the shelving which can prove to be dangerous.
Therefore, in order for a Reinforcement Learning algorithm to be production-ready in order to be profitable for a Logistics company, it will need to handle these sorts of caveats. Conceptually, the inclusion of these caveats in a Reinforcement Learning algorithm would blow out the size of complexity by hundreds or even thousands of times. By equivalence, the algorithm that won the game of Go (the AlphaGo Zero algorithm) will need to be hundreds of thousands of times smarter than it currently is, if it were to handle a standard Warehouse in operation today. This is the curse of dimensionality at its best.

Are There Any Examples of This In Operation Today?
Yes. There are examples both from within business and within academia of some great work in this area already.
Within Business
Companies such as Amazon have dramatically changed the way warehouses are set today, by creating a Goods-To-People model, instead of the traditional People-To-Goods model (which is used in the above example). Amazon Robotics uses much smaller shelves to stock goods, and little robots which zip around the warehouse picking up and putting down the shelves as necessary.

It is based on premise of ‘organised chaos’, as shown in these two video clips: The Robots Making Amazon Even Faster and A Day in the Life of a Kiva Robot. Basically, the system works by having any item placed on any shelf, and any shelf can have any configuration. The system in the background then is optimised to pick the shelf that has the most number of goods needed by the workers, and ‘fetches’ that shelf to bring the goods to the person.
Really, it’s quite intelligent. However is based on two very key assumptions:
- The Pick quantities will only ever be very small (eg. 1 or 2 items needed to be picked from each shelf); and
- The physical warehouse is able to be completely re-designed and all fixed shelves replaced with movable shelves.
The limitation of the first assumption means that you cannot cater to bulk-buying customers. Think of a Hospital buying two pallets of face-masks, versus a single consumer buying just one book. They are two different types of customers, and the Amazon solution can only cater for one type. Furthermore, the limitation of the second assumption means that you cannot have a split semi-automated warehouse; it must be entirely goods-to-person, or entirely person-to-goods.
Due to this, there are very very few warehouses in the world that use this type of automation. It has not become wide-stream, and may not for some time yet.
Within Academia
The overlap between Reinforcement Learning and Warehousing Operations is still an ongoing area of research. There is a strong nexus to the research around decision support systems, and as such there is a focus around product allocation methods, and order-pick routing. Three such examples include:
- A Reinforcement Learning Approach for a Decision Support System for Logistics Networks (Rabe & Dross, 2015)
- Reinforcement Learning Approach to Product Allocation (Andra, 2010)
- How to Apply Reinforcement Learning to Order-Pick Routing in Warehouses (Rutten, nd.)
The appetite for Reinforcement Learning applications within an business environment is increasing, and therefore the momentum of research in to this area will also be increasing. Over the coming years, there will be more innovation around this topic, and how to address the curse of dimensionality.
Conclusion
Reinforcement Learning is one area of Machine Learning and Artificial Intelligence that will one day prove to be very beneficial and profitable to businesses. The history of Reinforcement Learning is very promising, especially considering the wins it has had in the areas of board games (such as Chess and Go), and in electronic Gaming (such as Atari). When applied to the Warehousing and Logistics industries, the possibilities are endless; particularly considering some of the nuances and the complexities currently seen by operators of the current Warehouse Management Systems.
A Reinforcement Learning algorithm can prove to be beneficial in this area by automatically scaling and optimising to handle business needs, and the business would not need to rely on complicated and cumbersome automation strategies.
However, in order to achieve an operational-ready product, there are a number of aspects of warehouse management which will need to be addressed. These complexities are currently prohibiting the immediate adoption of Reinforcement Learning algorithms in the industry… but not for long!