Mental Math with Stack-Based Programming February 16, 2017

I’ve dabbled in mental math algorithms since I was a kid but never really got an intuitive feel for it. As I’ve done more programming, I’ve had a couple of insights that have helped me understand how to create and express mental math algorithms.

Computers algorithms make trade offs between memory and speed. You can improve the speed of an algorithm with strategic caching, that is, using more memory. If you need to use less memory, expect the algorithm to take longer. The same applies to non-computer algorithms. Elementary math classes taught you paper-based algorithms for addition, subtraction, multiplication, long division, and, if you were lucky, extracting square roots. Paper is memory. You can work out the answer faster by jotting digits here and there, carrying, borrowing, writing down intermediate calculations. It’s strategic caching. Paper memory is cheap, use as much as you need.

But in mental math, working memory isn’t cheap. It’s notoriously limited, typically seven to ten digits. Mental math isn’t about doing paper algorithms in our heads, it’s about finding whole new algorithms that work with limited working memory.

We can reduce the burden on working memory by using more complex algorithms. Elementary school algorithms are designed for simplicity so kids can do them, over and over, regardless of how much working memory they need. Mental math needs algorithms that use extra steps in order to reduce how many free variables we have to track. Memorizing the more complex algorithm shifts the burden from working memory to long-term memory.

Traditional math notation is optimized for symbolic manipulation, not working out algorithms. We don’t even use it for working arithmetic problems, preferring specialized notations that vary from addition to multiplication to division. Traditional math notation makes it difficult to see what numbers you’ll need in working memory and what parts of the algorithm stay the same regardless of the specific numbers. We need a new notation for manipulating algorithms, and computer science has one to offer: stack-based programming.

Stack-based programming is sometimes called “postfix notation”. In its simplest form, there’s a stack of data and a stack of programming instructions. The programming instructions are applied one at a time, manipulating the data stack.

I’ve found stack-based programming to be a useful way of working out mental math algorithms. There’s a natural division between the numbers you’re holding in working memory (the data stack) and the algorithm that stays the same between problems (the instruction stack).

To illustrate, I’m going to create a simple stack-based DSL within Clojure. I’ll use reduce and reductions to do the heavy lifting, where the reducer (apply-to-stack) is a function that takes the current data stack and a function, and returns a new data stack. It’ll start with the initial data stack and iterate over a collection of operations, which are just Clojure functions.

(defn apply-to-stack [data op]
  (if (fn? op)
    (op data)
    (conj data op)))

(defn arity [n f data]
  (conj (vec (drop-last n data))
        (apply f (take-last n data))))

(def add #(arity 2 + %))

(reductions apply-to-stack [1 2] [add])

apply-to-stack takes the current data stack and an operation. If the operation is a function, it passes the current data stack to the function and returns the result. If the operation is not a function, it pushes the operation (whatever it is) onto the stack.

I’ve also defined arity, which lets us easily define stack operations for existing Clojure functions, such as +. Unlike Clojure’s variable-argument functions, stack-based operations tend to take a fixed number of arguments. Above, I define add to take the top two arguments from the data stack, add them, and push the result back onto the data stack.

reductions shows the data stack at each step. The result is the top of the last stack, in this case, 3.

Here’s an example I’ve used when driving: converting miles per hour to feet per second.

(def multiply #(arity 2 * %))

(reductions apply-to-stack [70] [1.467 multiply])

So far, this is just postfix notation and doesn’t particularly help with doing the calculation in your head. Where stack-based programming shines, and part of the reason it’s the basis for the Java and Python virtual machines is that it’s simple to rewrite the instruction stack and know that the end result will be the same. For a VM, that means optimizing by reducing the total number of instructions. For mental math, it usually means more, simpler instructions.

In the above example, it’s tough to mentally multiply by 1.467. Let’s break it up into 1.5 - 0.033. You might express the instruction stack transformation as xy z -.

(def subtract #(arity 2 - %))

(reductions apply-to-stack [70] [1.5 0.033 subtract multiply])

To rewrite this, we need to turn multiplication of a difference into the difference of two multiplications, what mathematicians call distributing subtraction over multiplication. This isn’t hard in stack-based programming, but we do need two new operations: dup and dip. dup copies what’s on top of the stack, so we can multiply it by two different numbers. dip is more complex. It assumes a list of operations is on the top of the stack. It stores the list, pops it off, then pops off the next value in the stack and stores it too. It then applies the list of operations to the remaining stack, then puts the value it earlier popped off back onto the top of the stack.

Distribution of multiplication over subtraction has this transformation signature: x y - *dup [x *] dip y * -. Defining dup and dip and applying the transformation gives us

(defn dup [data]
  (conj data (peek data)))

(defn dip [data]
  (let [instructions (-> data peek)
        top (-> data pop peek)
        data (-> data pop pop)]
    (conj (reduce apply-to-stack data instructions) top)))

(reductions apply-to-stack [70] [dup [1.5 multiply] dip 0.033 multiply subtract])

Multiplying by 1.5 is just adding half the number to itself (1.5 *dup 2 / +). Multiplying by 0.033 is quite close to dividing by 10 then dividing by 3 (0.033 *10 / 3 /).

(def divide #(arity 2 / %))

(reductions apply-to-stack [70] [dup [dup 2 divide add] dip 10 divide 3 divide subtract])

No operation here is particularly difficult to do mentally: adding half a number to itself, dividing by 10, dividing by 3, subtracting. It’s also good for working memory: at most two values in the data stack vary from problem to problem; all the other values that end up there are part of the algorithm, and stay the same no matter what number you start with.

Most conversions work this way, breaking hard numbers and operations down into easier ones. I like to factor the algorithms until I’m not multiplying or dividing by anything other than 2, 3, 5, or a power of 10.

Square roots can be iteratively calculated using the Babylonian method. The data stack starts with the value you want the square root of and an initial guess. The final data stack retains the original number, followed by a better guess.

(reductions apply-to-stack [21 5] [[dup] dip dup [divide] dip add 2 divide])

If your initial guess is close enough to the actual square root, one iteration gets you quite close to the actual square root. You can also repeat it a few times.

(defn times [data]
  (let [count (-> data peek)
        instructions (-> data pop peek)]
    (loop [data (-> data pop pop)
           count count]
      (if (pos? count)
        (recur (reduce apply-to-stack data instructions) (dec count))

(reduce apply-to-stack [21 5] [[[dup] dip dup [divide] dip add 2 divide] 3 times])

Probably my favorite mental math algorithm is John Conway’s Doomsday rule. It’s non-trivial, but powerful. We’ll need several more instructions. Before defining them, I’ll define another helper function to construct operations: arity+. It’s just like arity, except instead of pushing a single value onto the data stack, it expects a sequence of values, and it pushes each value onto the data stack.

(defn arity+ [n f data]
  (vec (concat (drop-last n data)
               (apply f (take-last n data)))))

(def floor #(arity 1 js/Math.floor %))
(def remainder #(arity 2 mod %))

(defn swap [data]
  (arity+ 2 #(vector %2 %1) data))

(defn day-of-week [data]
  (arity 1 #(get '[Sunday Monday Tuesday Wednesday Thursday Friday Saturday] %) data))

(defn century-and-year [data]
  (arity+ 1
          (fn [year]
            (let [century (js/Math.floor (/ year 100))]
              [century (- year (* 100 century))]))

(defn divmod [data]
  (arity+ 2
          (fn [dividend divisor]
            (let [quotient (js/Math.floor (/ dividend divisor))]
              [quotient (- dividend (* divisor quotient))]))

(defn doomsday [data]
  (arity 2
         (fn [month leap-year?]
             (and (= 1 month) leap-year?) 4
             (= 1 month) 3
             (and (= 2 month) leap-year?) 1
             (= 2 month) 0
             (= 3 month) 0
             (= 4 month) 4
             (= 5 month) 9
             (= 6 month) 6
             (= 7 month) 11
             (= 8 month) 8
             (= 9 month) 5
             (= 10 month) 10
             (= 11 month) 7
             (= 12 month) 12))

(defn leap-year? [data]
  (arity 1
         (fn [year]
           (or (zero? (mod year 400))
               (and (zero? (mod year 4))
                    (not (zero? (mod year 100))))))

The algorithm starts with a date, month, and year on the data stack:

(def doomsday-rule [dup century-and-year 12 divmod dup 4 divide floor add add
                    [4 remainder 10 multiply 2 divide 2 add 7 remainder] dip
                    add 7 remainder
                    [leap-year? doomsday subtract 7 remainder] dip
                    add day-of-week])

(reduce apply-to-stack [15 2 1975] doomsday-rule)

This algorithm is certainly longer, and reductions shows the step-by-step working memory you’ll need to work it out. There’s some complexity hidden within individual instructions like leap-year? and doomsday, but with practice these steps quickly become trivial, hardly worse than multiplying by 2, 3, or 5. The second line in the definition can be replaced with some memorization, as it only ever produces four different numbers. The algorithm could be altered to require only one 7 remainder instead of three, but splitting it up keeps the numbers in working memory a bit smaller and more manageable.

The Doomsday rule is certainly more work than a multiplicative conversion or even finding square roots, but considering that it can give you the day of the week for any date on the Gregorian calendar, the complexity is worth the result. Stack-based programming isn’t necessarily the most concise or elucidating form for the algorithm, but this form allows us to inspect the algorithm’s impact on working memory.

(apply max-key count (reductions apply-to-stack [15 2 1975] doomsday-rule))

Eight numbers is a stretch for the average working memory, but hardly impossible. The first three values are the original date, so the burden on working memory is less if you can look at the date as you do the math.

Stack-based programming has been a useful way for me not only to understand mental math, but to improve how I do it. I hope you find it helpful too.