Painting Agents

Published December 19, 2018

This project uses reinforcement learning to train an agent to paint a picture. I wanted to see a particular effect – namely having a number of agents working autonomously (i.e. controlled by separate ‘brains’) and cooperatively to complete a task.

What is Reinforcement Learning?

Reinforcement learning is a method of training a machine learning model. It uses a reward structure, rather than a set of known (labelled) data, to learn how to complete a task. If the model performs as desired, it is given a reward (‘good model!’) If not, it is reprimanded (‘bad model!’) Because RL allows training without a dataset, it is often used by roboticists to train real-world robots (or simulations of real-world robots) for complex tasks for which no datasets exist.

For me, working with RL in a physics environment is particularly fun because the agents’ actions appear as behaviors. By imbuing an agent with senses (inputs) and with desires (a reward structure), they seem to take on simple personalities. It becomes possible to recognize when, for instance, an agent’s risk averse tendencies limit its ability to complete a task. I can then adjust the reward structure accordingly, until I have an agent with a no-rules attitude, a denim jacket and a set of aviators.


This project uses the ml-agents framework recently released in Beta by the Unity game engine ( I chose this because I feel comfortable working in Unity and the ml-agents frameworks is well supported and documented. This framework uses tensorflow to train a model using PPO (Proximal Policy Optimization). The important thing for my project was that Unity ML-Agents framework abstracted away a lot of the PPO configuration parameters (timeHorizon, hiddenUnits, lambda, etc.) that I had neither the time nor energy to learn and allowed me to focus on the reward structure.


First (that is to say, after many many false starts…) I created a simple environment in which the agents could learn. Within this 3D environment is a grid of X by X white squares – the canvas on which the agents ‘paint.’ Next to this canvas is a reference image, with the same number of squares (pixels, basically), each either black or white.


Each step of the simulation, the agent moves one square up, down, left or right. As the agents move around the canvas, any square on which they land turns black.

I tried two sets of agent sensors. First I used a set of overhead cameras, one each for the ‘canvas’ and the reference image. I tried this because I felt that the “God’s eye view” of the system as a whole might allow the agent to learn the desired behavior (paint the reference image on the canvas) with the least amount of my meddling in its affairs (curating which data the agent sees), and that there was something elegant to allowing reinforcement learning to run its course and see what played out. This was the ‘throw everything at the wall and see what sticks’ approach.

At a certain point, I began to realize that some data curation might produce better results, be more flexible in terms of allowing different sized canvases and more closely replicate what a real-world robot with sensors might ‘see.’ So, I gave the agent four sensors which detected whether the color of the squares immediately above, below, to the left and right of it were black or white.

Reward structure:

I tried many different variations of the reward structure before setting on the simplest:

  • if the agent does not move, give a small negative reward (-0.5)

  • if the square moves to a black square, give no reward

  • if the agent moves to a white square which should be black (according to reference image), give a positive reward (+1.0)

  • if the agent moves to a white square which should be white (according to reference image), give a negative reward (-1.0)


Each of these behaviors was the result of different reward structures, agent sensor structures and the occasional errors in my code. The interesting thing to note is that each of these agents has a recognizable behavior and can be debugged by observing and trying to understand its behavior.

This agent is shy, and likes the spend time in the corner. I added a negative reward for staying in the same spot after seeing this behavior:

This agent tends to avoid negative rewards by staying on those squares it has already visited. I briefly added a negative reward for re-visiting squares after this:

This agent goes to the top right corner. This was the result of an error in my code:

This agent seems to be replicating the desired image:

Let’s give it some friends and a somewhat more difficult picture. Note that this gif is sped up (more in the ‘Next Steps’ section):

Next Steps:

I only realized after I added multiple agents and scaled up the canvas size that this was built in an extremely inefficient way. So the first step to expanding this project is to rebuild it in a more performant manner.

That done, I would like to…

  • allow greyscale (not just black and white),
  • include red, green and blue agents to allow painting of full color images.