Search N Rescue (Reinforcement Learning in Unity)
Skills Used: Deep Learning, Reinforcement Learning, Unity, Unity Reinforcement Learning Framework, PPO, C#
Problem/Environment
The purpose of this project was to train an agent(s) to find a target(s). The agent has the ability to move forward, backward, and rotate in either direction. The agent and target are contained on a square platform with walls on each side. At the beginning of each training episode the target and the agent are randomly placed so the agent does not learn a specific route. I made the tasks increasingly complex with the following scenarios:
Single Agent, Single Target (SAST)
Multiple Agent, Single Target (MAST)
Adversarial Agent vs Seeker Agent (AvS)
Approach (SAST/MAST)
I used the PPO algorithm as it is best suited to self-play in reinforcement learning environments. I decided to use an untrained network for each task instead of building off of the previous network. I did this because the training times were relatively short and I wanted to see what unique behaviors emerged. I used 12 training spaces in one scene to accelerate training, meaning I trained the network 12 times as fast. The general parameters for each task are outlined below:
SAST
Training Time: 2M steps, 5 hr 1 min
Step Penalty: 0.002
Total Reward: 1.0
MAST
Training Time: 2M steps, 5 hr 11 min
Penalties: step - 0.002, wall collision - 0.005, agent collision - 0.01
Total Reward: 1.0
In the MAST scenario I added penalties for colliding with the wall because in a real life scenario we would not want to risk damaging the hardware. The penalty for colliding with the agent is for the same reason, but also if they are colliding it most likely means they are double searching the area leading to a less efficient search.
Approach (AvS)
Training Time (steps/time):
Hider - 4.6M steps/15hr 24min
Seeker - 4.8M steps/16hr 37min
Rewards:
Hider: 5.0 for reaching max environment steps
Seeker: 3.0 reward for finding the target.
Penalty:
Hider: 3.0 for seeker colliding with target
Seeker: 0.001 step penalty, 3.0 for not finding the target within the given time
The adversarial task was a unique challenge in the sense of picking rewards and penalties. I initially had a hider reward for each time step, but that seemed to make the hider too complacent and not force action. I also increased the hider penalty when the seeker found the target. This task required much more training than the other two due to the complexity of the interactions.
Results/Discussion
On the left three graphs, orange represents the SAST task results, while blue represents the MAST task results.
The first graph represents average episode length. Each step the agents received a penalty so having a shorter episode resulted in a bigger reward. It seems that to a certain extent the MAST agents learned to separate and search the environment more efficiently than just one agent was able to (shown by lower overall episode length achieved). One interesting thing to note from the video is that the agents don’t go backwards. They are always moving forwards, which makes sense, that due to the vision constraints (agents can only see forwards), they learned a behavior to maximize the time searching new terrain.
The next two graphs represent cumulative reward vs training steps. The MAST task agents were able to get a slightly better average reward (0.70 vs 0.65 of the SAST task).
The AvS task required much more training and tuning. In the graphs above, the red is the hider and the blue is the seeker. The seeker has an initial advantage (due to the step penalty), but the hider soon figured out how to push the block. They trade back and forth and reached a stalemate for as long as I was willing to train for. The hider only developed the strategy of moving the block once and did nothing after the initial action. Additional tuning and training could help the seeker or the hider develop winning strategies.