How Machines Learn

In recent years it has become apparent that one of the most important components for the future of artificial intelligence (AI) is the ability of a machine to learn from its interactions with the surrounding world. There are only so many behavioural patterns you can directly program into a computer, so it is essential that such behaviours be discovered automatically based on the computer’s environment. Indeed, learning is a fundamental part of human intelligence, and there’s no reason to believe that it is not equally as important for the creation of AI agents.

In order to understand the current trends in AI, it is vital that we ask the following question: what does it really mean for a machine to learn about the world?

Our conception of learning is inextricably tied to human learning. A person can learn how to play piano, or learn the capital of a country; they can learn by perfecting skills or behaviours, or by acquiring new knowledge. While discovering the precise mechanisms of how we learn is an open scientific problem, we have an innate understanding of what learning is, at least in the context of living creatures.

Yet much of our intuition fails when we move into the domain of machines that learn. There are many skills that are easy for humans to learn that pose an almost insurmountable challenge to the most sophisticated of modern AI’s. For example, humans are able to explore an environment they have never previously seen, while most AIs would thrash about hopelessly. Observing the hype surrounding AI, one could easily assume that learning machines are equal parts science and magic — much too complicated for the average person. This isn’t true! Many principles of machine learning are straightforward and can be illustrated with several well-chosen examples. It is important for everyone to understand these principles so we can make informed decisions about our future, given that AI will almost certainly be relevant to the fate of our species.

When we speak of learning in machines, we are referring to the process by which a machine automatically discovers functions. A function is simply a way of mapping some inputs you receive to a set of outputs. For example, a teacher could define a simple function for computing the grade of their students by taking the average of their marks on each test. We often use such simple functions in everyday life without thinking twice.

However, there are many tasks for which the underlying functions are extremely complicated and cannot be specified in advance. For instance, trying to specify a function by hand that takes as input a list of pixels from an image, and outputs the object in the image, is futile. Instead, we often have example inputs and outputs for our desired function; for object recognition, we have large datasets of images labelled with the primary object in the image. We would like to be able to discover the functions automatically using only these input and output examples. This is the paradigm of supervised learning, and it is one of the most important (although not the only) domains of machine learning.

When we speak of a machine that is learning, we usually mean that the machine is executing some process in order to find a function that correctly predicts outputs given some inputs. What is lacking from this description, however, is an explanation of how this is done. What exactly is this process that the machine is using to find suitable functions?

Picture a large machine, standing vertically against the wall of a room, flat except for the protrusions of thousands of circular knobs. On the left is a terminal where you can provide the electronic inputs to the machine, and on the right is a terminal where you observe the outputs. These knobs are called parameters, and depending on their configuration the machine will produce different outputs given the inputs. In other words, the machine computes a different function for each setting of the knobs. While there are many processes for a machine to discover the function you want, most of them rely on the same fundamental components, and we detail here a process for learning such a knob-machine that is applicable to a wide number of problems.[iii] Of course, in real life the machine and the knobs exist virtually inside a computer.

One can imagine a learning machine as a physical computer with many knobs that are tuned automatically. Each setting of the knobs corresponds to a different function that maps inputs to outputs. This particular image is a depiction of the Enigma machine. (source: quintiq.com)

First, we need a way of associating a certain setting of knobs on the machine to a particular function. This is called the parameterization of the model, or the way the parameters impact the output of the model. For example, a neural network is a particular way of assigning parameters to candidate functions. The model parameterization is very important, as we would like (at least) one setting of the knobs on the machine to correspond to the function we want to learn.

Next, we need a way to determine how well we are doing at setting the knobs of the machine. In other words, we need to know how well our current function is predicting the outputs from our data. If we don’t know how well we are doing, it’s impossible to tell when we should stop tweaking the knobs. This is called a loss function, or simply the loss. There are many examples of loss functions in machine learning, and they depend on the task being performed. The loss function gives us an idea of how many mistakes we are making, so of course we would like to minimize it.

Finally, we need a way to decide how to tweak the knobs of the machine in order to improve our function and minimize our loss. This is often called an optimization algorithm. An algorithm is just a technical name for a series of instructions to reach a final product, like a recipe, so an optimization algorithm is simply a recipe for optimizing the position of the knobs on your machine. The optimization algorithm asks: given the current input to my machine, how can I best tweak the knobs so that the output of my machine is closer to the real output from my dataset? A notable example of this kind of optimization is the backpropagation algorithm, which is used to train neural networks.

So a machine with a particular parameterization learns about the world by running an optimization algorithm to minimize a loss function. While each of these components is specified in advance by some human designer, once the optimization algorithm begins, the machine works completely autonomously. Indeed, the process of a machine that is learning is simply the execution of the optimization algorithm. Together, these three components are often referred to as the learning algorithm: they govern the entirety of the behaviour of the machine.

In order to have a better understanding of the learning process in a machine, let’s consider a simplified example: a machine with a single knob. Our data consists of sets of two numbers: the grades of students in a math class, and their grades in a physics class. Our goal is to discover a function that can predict the physics grade of a student (the output) given their grade in math (the input). We will assume that this relationship can be predicted using a simple line, and that the slope of this line is controlled by the machine’s knob. This is our model parameterization, which must always be specified in advance.

The data: each point is a student, with their math grade on the horizontal axis and their physics grade on the vertical axis.

Next, we will take as our loss function the average distance between our prediction line and the data points – the further our line is from the points, the worse our function is doing. Finally, we will use a simple optimization algorithm that moves the knob (i.e. changes the slope of the line) slightly in one direction, and continues in that direction if the loss function gets smaller. The algorithm will switch directions if the loss function becomes bigger. This is a simplified form of linear regression, a basic technique in machine learning.

A step-by-step guide to our learning algorithm.

Now that we have defined our learning algorithm in its entirety, let’s see what happens when we apply it to this dataset. First, we will randomly set the value of our slope; let’s say it is too steep to fit the data, like in the figure below.

We have set the initial slope of our blue prediction line randomly, and it is too steep to fit our data. Our loss represents how far each point is from the prediction line, as illustrated in green.

We can now let our optimization algorithm do the work. If the algorithm randomly decides to increase the slope, it will see that our prediction line is further away from the data points we are trying to predict. In other words, our loss went up.

After moving our prediction line in the wrong direction, the average distance from our line to the points (i.e. our loss) goes up.

Recognizing that our loss went up, the optimization algorithm changes directions and decreases the slope. It will continue to decrease the slope until decreasing it any further will increase the loss again. At this point we can stop our algorithm, as it has converged on the final solution.

Convergence of the algorithm: we have found the best value of our parameter to explain the data. Note that the loss is above zero, since the points aren’t perfectly predicted.

Voila! Our machine learning algorithm has discovered a relationship between grades in math and in physics – in particular, it has discovered the following simple function: if you take a student’s grade in math, and multiply it by 1.03, you will approximately calculate their grade in physics. This relationship won’t hold exactly for all students, since there are many factors that we aren’t taking into account (a student’s work ethic, grades in other classes, and random variations), but it is the best relationship we can find given the data we have and the parameterization of our model.[v] To make sure this relationship holds in general, we will have to test it on data examples that our algorithm didn’t use in learning,[vi] but from examining the data we can be pretty confident that the learned relationship makes sense.

Of course, this is an extremely simple example where machine learning is not really useful: you could easily solve for the parameter using a direct mathematical formula. The power of machine learning comes when we have thousands of inputs and outputs related by complex non-linear functions, with millions of data points. Machine learning algorithms can reveal patterns in large datasets that are undetectable by humans, and can perform human tasks, such as object recognition or spam filtering, at a scale beyond what was previously possible. Yet the simple learning algorithm presented above has the same fundamental ingredients as many of the state-of-the-art methods built by AI experts today.

Of course there are exceptions. AI isn’t just about making predictions about the world – it also requires taking actions in the world, and learning from your surroundings even when no labels are available. These domains of reinforcement learning and unsupervised learning, respectively, are crucial to building an artificial intelligence that can adapt to the world around it, and begin to learn like a human. But irrespective of the domain or the specific techniques involved, the main idea is clear: the basics of machine learning can be understood by anybody.

[i] Note that when we speak of machines, we are referring primarily to computers rather than robots. One can think of the learning part of a machine as the ‘brain’, which we can consider independently from the ‘body’ that it is placed in. Thus, we will mostly refer to ‘machines’, ‘computers’ and ‘AIs’ interchangeably.

[ii] Other domains of machine learning include unsupervised learning and reinforcement learning, which we will cover in future articles.

[iii] There are some supervised machine learning models do not operate under exactly the framework described here, for example non-parametric models.

[iv] Although there are some models whose capacity can grow while they are trained.

[v] Optimization algorithms are not in general guaranteed to converge to the ‘best solution’, but this is the case for our simple example.

[vi] This principle, called generalization, is crucial in machine learning. If you have a model parameterization that is too powerful for the relationship you are trying to discover, your algorithm can learn to fit too closely to the data points, as it ignores the effect of unobserved variables and random noise. This is called overfitting, and will be discussed in more detail in future articles.

Author: RYAN LOWE. Ryan Lowe is a PhD student studying machine learning at McGill University.

Leave a Reply Cancel reply