CalcGPT.io

CalcGPT.io

Web
Vue.js
Astro
JavaScript
TypeScript
Sass
Machine Learning
LLM
CI/CD

See it in action

Reaching #1 on Hacker News

https://news.ycombinator.com/item?id=41092460

The Project’s Inception

In 2017, a whimsical idea struck me - what if I created a calculator that boasted of incorporating all the popular buzzwords of the tech industry at the time but is worse than a regular calculator. This project was not so much about designing an efficient tool, but more a light-hearted jest at the buzzword frenzy in the tech world. It was all about marrying these sophisticated terms with a simple, mundane task and seeing it fail at the straightforward task.

Machine Learning

Update: There’s a similar blog post by Lim Yung-Hui Discover How Mistral Large 2, Claude 3.5, GPT-4o and Gemini 1.5 Pro Stack Up in an Addition Accuracy Test

Around the mid-late 2010s, machine learning entered the tech buzzword dictionary. AlphaGo beat the world champion in Go, GANs were generating realistic images, and the tech industry was abuzz with the potential of machine learning. I decided to incorporate machine learning into my calculator somehow.

These are the models I’ve tried over the years:

Linear Regression

The first model I tried was a simple linear regression model. A linear model finds the best weight wiw_i for each input xix_i such that prediction=w1x1+w2x2+...+wnxnprediction = w_1 * x_1 + w_2 * x_2 + ... + w_n * x_n. I trained it to predict the sum of two numbers. The model was trained on a dataset of all possible combinations of two numbers a,b[0,100]a, b \in [0,100] and their sums. Unsurprisingly, the model performed perfectly by setting all the weights to 1. The error matrix below shows the error for each prediction of a+ba+b. (The absolute difference between the prediction and the actual sum)

Error matrix of a linear regression model trained to predict the sum of two numbers

However, this model was too good and too fast. The user couldn’t even sense its presence. I needed something that’s slower and unpredictable.

GPT-2

In 2019, OpenAI released GPT-2, a model that astounded people with its ability to generate text that closely resembled human writing. As the public grappled with its potential implications concerning fake news and misinformation, I got busy making it do math. Little did I realize this would be the beginning of a long journey.

Tokenization

The GPT family of language models are trained on vast amounts of text, most of which is gathered from the internet. GPTs along long with many other language models, are primarily designed to predict a sequence of words. However, in order to analyze raw language data effectively, language models must break it down into smaller, meaningful components. This process is called tokenization. Tokens can be most easily understood as the smallest units of meaning. Sometimes a token can represent a word, but it can also represent part of a word or multiple words. (for example the “un” prefix is a token in the GPT tokenizer.) For further information about tokenization, you can check out this article. When the a GPT model is given a prompt (a piece of text), it generates a set of tokens along with the likelihood of each being the next in the sequence. It works in a similar fashion to your phone’s keyboard autocomplete feature.

Prompt Engineering

As GPT models generate the most likely next tokens, I need to think of a context where the next tokens are the answer to a math problem. (To keep it simple, I’ll set the model to always choose the most likely next token.) The process of crafting a prompt that will generate the desired output is called prompt engineering. One obvious way prompt is to give the model the problem and see what it fills in next. I used a pre-trained GPT-2 model on Hugging Face. Let’s give that a try:

1+1=1+1=1+1=1+
Prompt GPT-2 {max_length: 10, temperature: 0.0000000000001}

Hmm … not quite what I wanted. What if I try using words?

What is 1+1? 1+1 is the number of times
Prompt GPT-2 {max_length: 10, temperature: 0.0000000000001}

I like that it’s giving nonsensical answers, but I do want it to give the correct answer sometimes. Looks like we need to give the model more context.

Few-Shot Learning

In a 2020 paper, OpenAI introduced the idea of few-shot learning. The idea is to give a language model a few examples of the desired output and let it figure out the pattern. This is similar to how humans learn. We don’t need to see every possible example of a concept to understand it. We can figure out the pattern from a few examples.

Let’s try giving the model a few examples of the desired output.

1+1=2 5-2=3 2*4=8 9/3=3 3+8=3 9/3=3 9/
Prompt GPT-2 {max_length: 10, temperature: 0.0000000000001}

Even though the model is giving an incorrect answer, it seems to at least be able to generate something in a structured manner, which is useful for extracting and displaying the model’s output to the user. The stop_token parameter in the API documentation caught my eye. It allows me to designate a token where the model should stop generating. I can use this to make the model stop generating before generating a new line.

1+1=2 5-2=3 2*4=8 9/3=3 3+8=3
Prompt GPT-2 {temperature: 0.0000000000001, stop_token: "\n"}

Fine Tuning

Readers familiar with machine learning may be wondering why I haven’t mentioned fine-tuning yet. Fine-tuning is the process of training a model on a specific task after it has been trained on a general task. I created a dataset of random equations and their answers and trained the model to predict the next token. The fine-tuned model has been lost to time, but I remember it not being much better than the pre-trained model.

GPT-3

Fortunately by this time, OpenAI had launched GPT-3 via an API, a much more complex model than GPT-2. I was eager to test it out — maybe among its added complexity, it captured the ability to do math.

1+1=2 5-2=3 2*4=8 9/3=3 3+8=11
Prompt GPT-3 (davinci) {temperature: 0, stop_token: "\n"}

In fact GPT-3 (davinci) can solve the problem without any examples (zero-shot):

3+8=11
Prompt GPT-3 (davinci) {temperature: 0, stop_token: "\n"}

Along with GPT-3 (davinci), OpenAI launched three other smaller and cheaper models (curie, babbage, and ada). I wanted to find the smallest model that could solve the problem to keep costs down. After all I’m paying for the API calls.

GPT-3 (curie)

3+8=11
Prompt GPT-3 (curie) {temperature: 0, stop_token: "\n"}

GPT-3 (babbage)

3+8=0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5+0.5
Prompt GPT-3 (babbage) {temperature: 0, stop_token: "\n"}

Looks like GPT-3 (babbage) can’t solve the problem without examples. Let’s try giving it some examples.

1+1=2 5-2=3 2*4=8 9/3=3 3+8=11
Prompt GPT-3 (babbage) {temperature: 0, stop_token: "\n"}

GPT-3 (ada)

1+1=2 5-2=3 2*4=8 9/3=3 3+8=10
Prompt GPT-3 (ada) {temperature: 0, stop_token: "\n"}

GPT-3 (ada) can’t solve the problem with examples. So GPT-3 (babbage) seems to be the smallest model that can solve basic arithmetic.

To test its ability further, I ran the test data I used for linear regression through it using the following prompt:

1+1=2 5-2=3 2*4=8 9/3=3 {a}+{b}=
Prompt
Error matrix of a GPT-3 (babbage) model at temperature 0 trained to predict the sum of two numbers

Yay! 🎉 It’s making mistakes. Now I just need to figure out how to control the model’s output to make it more unpredictable. It’s interesting that the model seems to do worse with large numbers, especially when the it’s {large_number}+{small_number} but not as much when it’s {small_number}+{large_number}. Presumably this is because people tend to write the smaller number first, so the model had more examples of that.

Temperature

While I want the model to be able to generate the correct answer, I also want it to sometimes generate incorrect answers. Remember how I mentioned that the model generates a set of tokens along with the likelihood of each being the next in the sequence? The temperature parameter controls the probability of the model choosing a token with a lower likelihood. A higher temperature means the model is more likely to choose tokens with a lower likelihood. Let’s try increasing the temperature to 1.

Note: The temperature parameter can’t actually be set to 0 because it’s used as the denominator in a division operation. But OpenAI’s API allows setting it to 0 and replaces it with a very small number internally.

Error matrix of a GPT-3 (babbage) model at temperature 1 trained to predict the sum of two numbers

The model is making more mistakes. I think it’s a good mix of correct and incorrect answers for my calculator idea. It’s also outputting more non-numeric tokens, which is more amusing for a calculator.

GPT-4

In 2021, OpenAI launched GPT-4 via an conversational “Chat Completions” API. While more expensive than the GPT-3 and not having a completion API, it excels at zero-shot learning. As it turns out the most natural way for users to provide context is via a conversation. Let’s see if it’s any good at math. I’ll use the system prompt "You are a calculator" and feed it the problem as the user input ({a}+{b}).

Error matrix of a GPT-4 (gpt-4-0613) model at temperature 2 trained to predict the sum of two numbers

Even with temperature set to the max of 2, it’s getting most of the answers correct. Although it’s not what I’m looking for, it’s impressive how far we’ve come in a few years. I will stick with GPT-3 for this project for now.