Smallest transformer that can add two 10-digit numbers

(github.com)

125 points | by ks2048 1 day ago

19 comments

reerdna 9 minutes ago
I couldn't help but laugh out loud at the notion of a "held-out test set" for addition of 10-digit numbers.
alexlitz 3 hours ago
I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...
[-]
- ks2048 28 minutes ago
  I didn't look at all the details, but wanted to see how you did the initial embedding and see you do have a 14x5 matrix there. I guess when you are setting things by-hand (rather than learning), the definition of counting "parameters" is a bit unclear. One could say all those are parameters! even if setting in a straight-forward way.
- sowbug 3 hours ago
  I ask this question as someone who can't do much more than confirm that your blog post is written in English by someone who knows math.
  Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)
  [-]
  - alexlitz 3 hours ago
    I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. Also there are smaller ones that were trained so would still be more like 311/36 ~= 8.6.
    [-]
    - Lerc 2 hours ago
      >I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling.
      True, but with even smarter humans, you could exploit the interactions for additional calculations.
      While it sounds a bit silly, it is one of the hypotheses behind a fast takeoff. An AI that is sufficiently smart could design a network better than a trained one and could make something much smarter than itself on the same hardware. The question then becomes if that new smarter one can do an even better job. I suspect diminishing returns, but then again I am insufficiently smart.
    - sowbug 3 hours ago
      Thanks!
      (I see the Trained Weights results now, thanks.)
prng2021 29 minutes ago
How is anyone predicting timelines for AGI when these systems can’t do basic addition of 2 arbitrary numbers with 100% accuracy?
[-]
- wmf 15 minutes ago
  LLMs should use tool calling (which is 100% reliable) instead of doing math internally. But in general it would be nice to be able to teach a process and have the AI execute it deterministically. In some sense, reliability between 99% and 100% is the worst because you still can't trust the output but the verification feels like wasted effort. Maybe code gen and execution will get us there.
amelius 5 hours ago
> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.
I wonder why they don't just write the code themselves, so by design the focus can be on the model.
delta_p_delta_x 3 hours ago
Very cool, but can I suggest the `add` CPU instruction instead? Supports 64-bit numbers, and it's encoded in hardware, and no need to cross a PCIe interface into a beefy, power-hungry GPU and back again. And chances are it's cross-platform, because basically every ISA since the very first has had `add`.
[-]
- ACCount37 22 minutes ago
  No. You cannot. It's the wrong tool for the problem.
  That little "add" of yours has the overhead of: having an LLM emit it as a tool call, having to pause the LLM inference while waiting for it to resolve, then having to encode the result as a token to feed it back.
  At the same time, a "transformer-native" addition circuit? Can be executed within a single forward pass at a trivial cost, generate transformer-native representations, operate both in prefill and in autoregressive generation, and more. It's cheaper.
- nurettin 2 hours ago
  I mean, yeah, no need to put a bunch of high powered cars in a circular track to watch them race really close to each other at incredible speeds, causing various hazards, either. Especially since city buses have been around for ages.
  [-]
  - delta_p_delta_x 1 hour ago
    I would similarly criticise a race car being used to do a city bus' job of getting a lot of people from point A to B.
    Although the converse would be interesting, racing city buses.
    [-]
    - pitaj 32 minutes ago
      Nobody has suggested using this for addition tasks in production. It's an academic exercise. What are you on about?
- mcdeltat 2 hours ago
  "smallest supercomputing cluster that can add two numbers"
vicchenai 1 hour ago
The leaderboard framing is clever - forces apples-to-apples comparison on a task where you can verify correctness deterministically. What I find interesting is the architectural constraints: 10-digit addition requires maintaining ~20 digits of working state across the carry chain, which is fundamentally sequential. The fact that tiny transformers can learn this at all (rather than just memorizing) suggests they are finding some form of positional carry representation in their attention patterns. Would love to see ablations on how attention head count vs depth trade off here - my intuition is that carry propagation needs depth more than width.
cantalopes 29 minutes ago
Interesting, is this just a fun competition or would this also have some practical applications i wonder?
E-Reverance 5 hours ago
Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...
medi8r 5 hours ago
You can do that in a single matmul of course.
[-]
- hyperhello 5 hours ago
  So can you take an arbitrary transformer and somehow turn it into a compact set of low-power fast gates by some algorithm?
  [-]
  - measurablefunc 5 hours ago
    I think you're misunderstanding the joke.
    [-]
    - medi8r 4 hours ago
      Yes joke is:
      [A B]
      times
      [1] [1]
      is
      [A+B]
      [-]
      - hyperhello 4 hours ago
        From context then, I infer that a transformer is not comprised of matrix multiplications, because it would simply be one that adds two 10-digit numbers.
        [-]
        medi8r 4 hours ago
        A transformer tokenizes input, does a bunch of matmul and relu set up in a certain way. It doesn't get to see the raw number (just like you don't when you look at 1+1 you need visual cortex etc. first.)
        [-]
        Lerc 3 hours ago
        Notably the difference is that ten digits is not the same thing as a number. One might say that turning it into a number might be the first step, but Neural nets being what they are, they are liable to produce the correct result without bothering to have a representation any more pure than a list of digits.
        I guess the analogy there is that a 74ls283 never really has a number either and just manipulates a series of logic levels.
        Filligree 2 hours ago
        So the question is, why do we tokenise it in such a way that it makes everything harder?
i000 4 hours ago
Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?
[-]
- ACCount37 12 minutes ago
  Good question.
  It might work, I considered running a test like this. But it does demand certain things.
  The subnetwork has to be either crafted as "gradient resistant" or remain frozen. Not all discovered or handcrafted circuits would survive gradient pressure as is. Especially the kind of gradients that fly in early pre-training.
  It has to be able to interface with native representations that would form in a real LLM during pre-training, which is not trivial. This should happen early enough in pre-training. Gradients must start routing through our subnetwork. We can trust "rich get richer" dynamics to take over from there, but for that, we need the full network to discover the subnetwork and start using it.
  And finally, it has to start being used for what we want it to be used for. It's possible that an "addition primitive" structure would be subsumed for something else, if you put it into the training run early enough, when LLM's native circuitry is nonexistent.
  Overall, for an early test, I'd spray 200 frozen copies of the same subnetwork into an LLM across different layers and watch the dynamics as it goes through pre-training. Roll extra synthetic addition problems into the pre-training data to help discovery along. Less of a principled solution and more of an engineering solution.
nextlevelwizard 40 minutes ago
Here: eval()
You are welcome
ks2048 5 hours ago
So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?
[-]
- alexlitz 3 hours ago
  For one the specific 36 parameter version is impossible without float64 so you might guess the corollary that it is not exactly amenable to being found by gradient descent. I think the question of how you can structure transformers and neural nets in general so that they can both very parsimoniously represent things like this and have it be amenible to learning by gradient descent.
- bitwize 3 hours ago
  "Minsky, why did you close your eyes?"
  "So that the room will be empty."
munro 4 hours ago
>=99% accuracy wtf?!?
I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.
computersuck 2 hours ago
this is the dumbest fking thing to do math with
1over137 4 hours ago
Now wrap it all in an Electron app!
[-]
- cantalopes 27 minutes ago
  And npm install llm-is-odd to divide and conquer!
MarcLore 4 hours ago
The gap between 36 hand-coded params and 311 trained params is fascinating and honestly underappreciated. It mirrors something we see repeatedly in ML: gradient descent finds solutions in a fundamentally different region of parameter space than a human engineer would design.
When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts.
I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy.
MarcLore 1 hour ago
[dead]
jaunt7632 4 hours ago
[dead]
Sophira 3 hours ago
I get that this is technically interesting, for certain, but the sheer amount of energy and associated global warming risk needed to do something with >=99% accuracy that we've been able to do easily for decades with a guaranteed 100% accuracy seems to me to be wasteful to the extreme.
[-]
- Lerc 3 hours ago
  What would be an acceptable amount of energy to spend on something that someone has done in a different manner before? Would you rather we stick with all of the current known ways to do things.
  Does this boil down to a condemnation of all scientific endeavours if they use resources?
  Would it change things if the people who did it enjoyed themselves? Would they have spent more energy playing a first person shooter to get the same degree of enjoyment?
  How do you make the calculation of the worth of a human endeavour? Perhaps the greater question is why are you making a calculation of the worth of a human endeavour.
  [-]
  - mcdeltat 2 hours ago
    Ok I don't really care either way but to play devil's advocate, what exactly is this specific challenge of adding numbers with a transformer model demonstrating/advancing? The pushpack from people, albeit a little aggressive, does have a grain of truth. We're demonstrating that a model which uses preexisting addition instructions can add numbers? I mean yeah you can do it with arbitrarily few parameters because you don't need a machine learning model at all. Not exactly groundbreaking so I reckon the debate is fair.
    Now if you said this proof of addition opens up some other interesting avenue of research, sure.
    [-]
    - Lerc 2 hours ago
      >what exactly is this specific challenge of adding numbers with a transformer model demonstrating/advancing?
      Well for starters, it puts the lie to the argument that a transformer can only output examples it has seen before. Performing the calculation on examples that haven't been seen demonstrates generalisation of the principles and not regurgitation.
      While this misconception persists in a large number of people, counterexamples can always serve a useful purpose.
      [-]
      - mcdeltat 31 minutes ago
        Are people usually claiming that it strictly cannot produce any output it hasn't seen before? I wouldn't agree, I mean clearly they are generating some form of new content. My argument would be that while they can learn to some extent, the power of their generalisation is still tragically weak, particularly in some domains.
      - qsera 1 hour ago
        >it puts the lie to the argument
        But it does not, right? You can either show it something, or modify the parameters in a way that resemble the result of showing it something.
        You can claim that the model didn't see the thing, but that would mean nothing, because you are making the same effect with parameter tweaks indirectly.
- userbinator 56 minutes ago
  Because it's fun. Life is meant to be enjoyed.
  Those who worry about an imaginary risk and live their lives in constant fear have turned into nothing more than machines enslaved by propaganda.
- mapontosevenths 57 minutes ago
  > the sheer amount of energy and associated global warming risk
  I think that's one very good reason to make them more efficient, and that's part of the point of contests like this one.
- coolsunglasses 3 hours ago
  >Hacker News
  not any more, eh?
- nradov 3 hours ago
  Wait until you see the quantum computer that it takes to factor the integer 15.
- thereisnospork 3 hours ago
  You need to recalibrate your sense of scale if you think that this is a geologically relevant usage of energy.