Hazard

Gradient Descent December 11, 2016

For a long time I’ve wanted to be able to calculate trend lines for data sets, so I was consistently disappointed in school when professors avoided the actual computations of good parameters and turned the work over to a software package. Recently I had some data needing a trend line, so I took a stab at fitting it myself.

If you were to find a trend line manually by trial and error, you’d tinker with the slope and y-intercept, shifting values up and down while monitoring how well the resulting trend line matched the data. In more mathematical terms, you’d explore the space of all parameters trying to minimize the difference between the trend line and the data.

This process is a natural candidate for automation. We’ll use a technique mathematicians call “gradient descent” and the rest of us would call “running downhill”.

We’ll need three preliminaries: a data set, a function to parameterize, and a way to measure how well a trend line matches the data.

Here’s a made-up data set:


(def data [[1955 200] [1960 277] [1963 250] [1968 302] [1975 388] [1980 412] [1984 451]
           [1990 448] [1996 490] [2000 480] [2008 612] [2011 618] [2013 598] [2016 648]])

(show data)

We’re only going to work with trend lines in this post, which have two parameters: a slope and a y-intercept. We’ll create a function that takes these two parameters and returns a linear function of its argument.


(defn linear [intercept slope]
  #(+ intercept (* slope %)))

A common way to measure how well a trend line fits some data is the mean squared error. It not only totals the difference between each data point and what the trend line predicts it should be, but also squares each difference so the resulting surface is differentiable at each point. This will be important, as gradients are just multi-dimensional derivatives.

Here’s the definition:


(defn mean-squared-error [f data]
  (/ (transduce
       (map (fn [[x y]]
             (js/Math.pow (- (f x) y) 2)))
       + data)
     (count data)))

We want to find the lowest possible values for this function given our data. If you imagine this function as representing a surface over our two-dimensional parameter space, we’re looking for the bowls where lakes would form. At the bottom of such a bowl, nudging the parameters in any direction increases the error.

We’ll pick a starting point and make our way downhill. Rather than feeling our way in the dark by randomly sampling neighboring points, we’re going to calculate the gradient at the point. The gradient is a vector pointing in the steepest direction uphill, so once we have it, we’ll go the opposite direction.

Here’s a definition of gradient that calculates slopes numerically:


(defn gradient [f params]
  (let [params (vec params)
        step 0.001]
    (map-indexed
      (fn [i x]
        (let [p- (apply f (update params i - step))
              p (apply f params)
              p+ (apply f (update params i + step))]
          (/ (+ (/ (- p p-) step)
              (/ (- p+ p) step))
             2)))
      params)))

We want the gradient of the mean squared error between a trend line and our data, taken at a specific point.


(gradient #(mean-squared-error (apply linear %&) data) [0 0])

These numbers are quite big, likely much bigger than our eventual slope and y-intercept. We wouldn’t want to pick the next set of parameters by simply adding this vector to our current parameters. It represents the right direction to go, but not the right distance. (The gradient’s enormous magnitude is a result of our initial guess having a very large error; it has nothing to do with how far we actually want to step in parameter space.)

Let’s normalize the gradient to a unit vector.


(defn normalize [v]
  (let [len (js/Math.sqrt (transduce (map #(js/Math.pow % 2)) + v))]
    (map #(/ % len) v)))

(normalize (gradient #(mean-squared-error (apply linear %&) data) [0 0]))

That’s more intelligible. This vector tells us that the mean squared error increases fastest if we decrease the slope of the line. To go the opposite direction, we’ll negate the vector and pick a new point in that direction.


(defn step [f params data scale]
 (->> params
   (gradient #(mean-squared-error (apply f %&) data))
   (normalize)
   (map -)
   (map #(* scale %))
   (map + params)))

(step linear [0 0] data 1)

The scale parameter lets us control whether we take big steps in parameter space or small steps.

Let’s use iterate to trace our progress step by step downhill.


(show (map-indexed (fn [i params]
                     [i (js/Math.round (mean-squared-error (apply linear params) data))])
                   (take 15 (iterate #(step linear % data 1) [0 0]))))

It looks like we’re stepping back and forth over a local minimum. Let’s try smaller steps.


(show (map-indexed (fn [i params]
                     [i (js/Math.round (mean-squared-error (apply linear params) data))])
                   (take 50 (iterate #(step linear % data 0.01) [0 0]))))

With smaller steps we’ve descended much further, reducing the error to less than a tenth of what it was above.

Let’s see if we can improve the error at all by making a more intelligent guess for our starting parameters, say, a slope of 5 and a y-intercept of -14,000.


(show (map-indexed (fn [i params]
                     [i (js/Math.round (mean-squared-error (apply linear params) data))])
                   (take 1000 (iterate #(step linear % data 0.01) [-14000 5]))))

There’s good news and bad news. The good news is we’ve reduced the error significantly. The bad news is that we only got such a good result after doing an initial analysis of the data. Ideally, gradient descent would find good parameters regardless of our starting point.

Let’s compare trend lines. Here’s the trend line for a starting guess of 0 slope and 0 y-intercept:


(let [params (last (take 500 (iterate #(step linear % data 0.01) [0 0])))]
  (show data (apply linear params))
  params)

Not much of a trend line. How about the trend line produced by our educated guess?


(let [params (last (take 500 (iterate #(step linear % data 0.01) [-14000 5])))]
  (show data (apply linear params))
  params)

Obviously better, but it’s hard to tell if this line is simply a product of our good guess and whether it’s really the line of best fit.

Let’s see if we can figure out why this is happening. Take another look at our initial, normalized gradient.


(normalize (gradient #(mean-squared-error (apply linear %&) data) [0 0]))

The error surface is steepest in the direction of the slope parameter, meaning we’ll get the biggest change in error by changing slope rather than y-intercept. Let’s plot the components of the gradient against each other to see how they change as we walk downhill.


(let [params (take 500 (iterate #(step linear % data 0.01) [0 0]))]
  (show (map (fn [params]
               (normalize (gradient #(mean-squared-error (apply linear %&) data) params)))
             params)))

This seemingly uninteresting graph illustrates that the slope component of the normalized gradient varies between -1 and 1, while the y-intercept component never strays far from 0. At each step, we can make a bigger change to the error by altering the slope rather than the y-intercept.

To give it a visual metaphor, it’s as if we’re walking in a valley with extremely steep sides. Somewhere down the valley is a lake (a local minimum), but so long as we walk strictly downhill, we tend to go closer and closer to the middle of the valley instead of going down it’s length.

It’s possible that if we got very, very close to the middle of the valley, it would flatten enough to send us down the length of the valley, finally changing the y-intercept by the big values it needs to change by. Perhaps this is what Wikipedia means when it says,

For poorly conditioned convex problems, gradient descent increasingly ‘zigzags’ as the gradients point nearly orthogonally to the shortest direction to a minimum point.

We certainly do have gradients pointing nearly perpendicular to the direction we need to go. In Wikipedia’s words, we have a “poorly conditioned convex problem”. Perhaps we could recondition the problem somehow. Can we improve our results by changing the problem?

The answer turns out to be yes. And we don’t have to change the problem much.

While any line in two-dimensions can be described solely by a slope and a y-intercept, we can also introduce a third parameter that slides the graph horizontally. Rather than dropping the y-intercept all the way down to -14,000, we can shift a line with a greater y-intercept to the right.

Here’s such a three-parameter version of linear:


(defn linear2 [x-offset intercept slope]
  #(+ intercept (* slope (- % x-offset))))

Let’s see how the error changes when we start with an uneducated starting point:


(show (map-indexed (fn [i params]
                     [i (js/Math.round (mean-squared-error (apply linear2 params) data))])
                   (take 100 (iterate #(step linear2 % data 0.01) [0 0 0]))))

That’s exactly as well as we did before when starting with zeros.

Rather than making an educated guess, let’s just make a better guess. When we specify a line using slope/intercept form, we’re simply saying that a line extends at some angle from a point on the y axis (the y-intercept). Adding a parameter that slides the line left or right allows us to describe a line as extending from any point in the x-y plane.

Let’s improve our initial guess without doing any analysis of the actual data. We’ll simply state that the line extends from our first data point. We’ll leave our guess of zero slope as it is.


(show (map-indexed (fn [i params]
                     [i (js/Math.round (mean-squared-error (apply linear2 params) data))])
                   (take 1000 (iterate #(step linear2 % data 0.01) [1955 200 0]))))

We’ve managed to reduce the error even further. Let’s see the trend line.


(let [params (last (take 5000 (iterate #(step linear2 % data 1) [1955 200 0])))]
  (show data (apply linear2 params))
  params)

It’s hard to feel particularly victorious after all this. We’ve avoided having to do analysis of the actual data by using a linear function of three parameters rather than two, but it’s still the slope that has changed most between our initial guess and the final fitting; we’re still just sliding down the valley wall. Could there be a better fit somewhere down the valley, attained by making bigger changes to the x and y offsets? Probably not, but my confidence has been shaken.

It’s time to try a different algorithm: simulated annealing.