Simulated Annealing December 15, 2016

After gradient descent was unsatisfactory for fitting a trend line, I turned to simulated annealing. Simulated annealing is a more sophisticated algorithm, but still relatively straight-forward. Compared to gradient descent, it’s much less intuitive. I’m indebted to Katrina Geltman for her excellent writeup and code examples.

We’ll need three things from the previous post: our data, our mean squared error function, and a linear function.

(def data [[1955 200] [1960 277] [1963 250] [1968 302] [1975 388] [1980 412] [1984 451]
           [1990 448] [1996 490] [2000 480] [2008 612] [2011 618] [2013 598] [2016 648]])

(show data)

We’ll use the same mean squared error function, without change.

(defn mean-squared-error [f data]
  (/ (transduce
       (map (fn [[x y]]
             (js/Math.pow (- (f x) y) 2)))
       + data)
     (count data)))

For our linear function, we’ll use the second one from the last post, the three-parameter version.

(defn linear [x-offset intercept slope]
  #(+ intercept (* slope (- % x-offset))))

Simulated annealing takes its name from actual annealing, in which metals are heated to the point that their atoms are free to move. As the metal cools, the atoms are able to find lower-energy arrangements that weren’t possible before the metal was heated. Translating to the digital domain, simulated annealing makes heavy use of randomness to find a better arrangement of function parameters.

The deeply counter-intuitive aspect to simulated annealing is that, as it hunts for better parameters, it occasionally accepts worse ones. This randomized walk through parameter space is an attempt to break away from merely local minima and find the global minimum. The algorithm starts off hot, frequently accepting bad parameters in place of good ones, and as it progresses, the algorithm cools, accepting fewer and fewer bad parameters, until it eventually becomes cold and accepts only improvements on its current position.

We’ll use a variable to track the algorithm’s temperature, starting with 1 (hot) and decreasing almost to 0 (cold). Whether worse parameters are accepted will depend on the current temperature. Here’s Katrina’s choice of an acceptance function that accepts poorly performing parameters less and less often as the temperature drops.

(defn accept? [prev-error new-error temp]
  (let [k (Math/pow Math/E (/ (- prev-error new-error) temp))]
    (< (rand) k)))

We’ll also need a way to pick new parameters based on our current position. There’s no canonical way to come up with new parameters; a lot depends on the particulars of the problem. In Anojh Gnanachandran’s post about using simulated annealing to solve Einstein’s Riddle, new parameters are generated by swapping an attribute between two houses. In our case, we want to find parameters in the vincinity of the current parameters.

(defn permute [params temp]
  (let [i (rand-int (count params))
        x (nth params i)
        shift (* (if (zero? x) 1 x)
                 temp 4 (- (rand) 0.5))]
    (assoc (vec params) i
           (+ x shift))))

(permute [5 100] 1)

We start by choosing a random parameter to change, then scaling it up or down. When the temperature is high, near 1, parameters can change by as much as twice their own value. This approach means it doesn’t matter whether the parameter is ten figures or only a fraction, it’ll always be able to change by some meaningful amount. It’ll also be able to reach past zero to flip its sign. As the temperature drops, new parameters are chosen closer and closer to the current value.

permute does handle one special case, which is when a parameter is zero. Because a new parameter is chosen as a proportion of the current value, parameters at zero will never change unless we ensure that the shift is non-zero.

We’ll follow Katrina’s lead and try several sets of parameters at each temperature. For each set of parameters, we’ll ask our accept? function if it should replace the current choice. If so, it’ll become the starting point for the next set of parameters to try.

Let’s define a function to walk parameter sets at a given temperature:

(defn walk [params temp cost steps]
  (loop [params params
         prev-error (cost params)
         step 0]
    (if (< step steps)
      (let [params' (permute params temp)
            new-error (cost params')]
        (if (accept? prev-error new-error temp)
          (recur params' new-error (inc step))
          (recur params prev-error (inc step))))
      [params prev-error])))

(walk [1955 5 100] 1 #(mean-squared-error (apply linear %) data) 10)

We now have everything we need to run our simulated annealer. We’ll start with a temperature of 1, cooling until we hit a sufficiently low temperature to quit.

(defn anneal [f params data]
  (let [cost #(mean-squared-error (apply f %) data)]
    (loop [params params
           prev-error (cost params)
           temp 1]
      (if (< 0.00001 temp)
        (let [[params new-error] (walk params temp cost 200)]
          (recur params new-error (* 0.9 temp)))

Let’s check the trend line when we give it the same parameters we gave at the end of the last post: the first point in the data set and a slope of zero.

(let [params (anneal linear [1955 200 0] data)]
  (show data (apply linear params))

It works, but we gave it the same well-chosen initial parameters we had to give to gradient descent to get a good answer. Let’s stray a bit from those good seed parameters and see how it fares:

(let [params (anneal linear [2000 100 0] data)]
  (show data (apply linear params))

How far afield can we go?

(let [params (anneal linear [100 1 0] data)]
  (show data (apply linear params))

Seemingly quite far.

There is of course a caveat. Simulated annealing is random, so if you re-evaluate the code above (Ctrl+Enter) over and over, you’ll see some not-so-good fits. They seem to be less common the better our starting parameters.

As with gradient descent, the algorithm performs better using a linear function of three parameters instead of two. Does adding more parameters always help? Would a linear function of four parameters improve the odds of finding a good fit? I doubt it, but I haven’t checked. I think the reason it helped in this case is that, even though we’re searching a three-dimensional space instead of a two-dimensional plane, the range of values for any given parameter doesn’t have to vary as much. Adding a parameter to the function turned a very broad search range into a smaller range but in more dimensions. Both simulated annealing and gradient descent may be more suited to exploring higher-dimensional parameter spaces than they are to searching the breadth and depth of lower-dimensional ones.

One area of further exploration is whether these algorithms perform better with more data. I’ve only used a dozen data points, and I wonder if having more would flatten the mean squared error surface we’re walking and make it easier to find good fits.

For the moment, though, I’m satisfied. I have a method for finding trend lines given a data set. It may require educated initial guesses, and it may not come up with great fits 100% of the time, but it’s good enough for my purposes.