12 comments

  • pteetor 4 hours ago
    In case you are unfamiliar with Karpathy's Loop[1], it is a genetic algorithm[2] where the genetic "mutations" are clever-but-random ideas generated by an LLM agent, aimed at improving a system.

      (1) Let the LLM randomly perturbate the system.
      (2) Measure the system's performance.
      (3a) If the perturbation improved performance, keep the change.
      (3b) Otherwise, don't.
      (4) Repeat
    
    [1] https://github.com/karpathy/autoresearch

    [2] https://en.wikipedia.org/wiki/Genetic_algorithm

    • faangguyindia 1 hour ago
      i actually do it differently

      > (1) Let the LLM randomly perturbate the system.

      instead of this i ask LLM to what's least likely to improve performance and then measure it.

      sometimes big gains come from places you thought are least likely.

    • 2001zhaozhao 2 hours ago
      Wtf, this has a name now? I thought of this exact idea literally months ago but never had the time to do any experiments on it.

      At the time I dismissed it as potentially being incredibly expensive for the improvement you do get, and runs into typical pitfalls of evolutionary algorithms (in the same way evolution doesn't let an organism grow a wheel, your LLM evolution algorithm will never come up with something that requires a far bigger leap than what you allow the LLM to perturb on a single step. Also the genetic algorithm will probably result in a vibecoded mess of short-sighted decisions just like evolution creates a spaghetti genome in real life.)

      I'll definitely need to look into how people have improved the idea and whether it is practical now.

      • beepdyboop 2 hours ago
        This is not a new idea at all, many many have had it, no one really can claim it
      • stingraycharles 1 hour ago
        Genetic algorithms have existed since the 60s / 70s, e.g. computers learning to play a game. LLMs aren’t particularly guide at it.

        I think hyperparameter tuning may actually be a kind of genetic algorithm.

        • janalsncm 20 minutes ago
          Hyperparameter tuning could be done by genetic algorithm. I think it’s a bit of a category error to say that it is a genetic algorithm though.

          Hyperparam tuning is usually done by Bayesian Optimization though.

      • naveen99 1 hour ago
        You know this doesn’t work most of the time…
  • sho_hn 4 hours ago
    Salient on the value of the verifier. Matches my experience in the last two quarters.

    Nice detail on the encountered failures. Very similar experiences with my own loops against testsuites.

    Great post. A snapshot in time.

  • Havoc 24 minutes ago
    Seems like this could be applied to many things. Database optimisation etc
  • fc417fc802 4 hours ago
    Extremely interesting but I don't understand why it was written by an LLM. Either the frontier models are far better than I realized or else writing this document required a lot of manual work regardless at which point why not keep it in your own voice?

    > The agent did not know that would also halve the LUT count. It found out by doing it and watching the synthesizer.

    So I guess this is an example of an LLM anthropomorphizing and making wild conjectures about the internal workings of a different LLM.

  • osti 3 hours ago
    > propose, implement, measure, keep the wins

    Pretty much what I did to let Codex with gpt5.4xhigh improve my fairly complex CUDA kernel which resulted in 20x throughput improvement.

    • hackyhacky 3 hours ago
      Concretely, what interesting changes did it make to achieve such a significant improvement?
      • osti 1 hour ago
        A lot of it was beyond me, but this was all the branch names for all the stuff it tried, most of it unsuccessful of course. About 10x perf improvement came from architectural changes, and then 2x from micro optimizations.

        https://pastebin.com/eac0SAYg

  • outside1234 3 hours ago
    Has anyone actually written a verifier for a business / project?
    • faeyanpiraat 11 minutes ago
      Its tangential, but: I’m currently doing a rewrite of the backend of a project, and the verifier is basically the instruction of “maintain v1 functionality if observed from the api side externally”. This allows making a lot of tests based on existing data in the system and how the frontend expects data.
    • sho_hn 3 hours ago
      I'd say "a verifier" here is a loose term. A great testsuite is a verifier. I've done reverse-engineering projects that involved generating trace logs from the object under test, having a reimplementation emit the same logs, and running strict comparisons.

      OP's post is basically pointing out what certainly many others have independently discovered: Your agent-based dev operation is as good as the test rituals and guard rails you give the agents.

    • dataviz1000 3 hours ago
      Can you explain your question a little more? The recursive agents will find the minimum to satisfy the deterministic termination condition including cheating. In other words, it will be literally correct yet wrong. I would go so far to say malicious compliance.

      I have recursive agent that finds trading strategies after recreating academic research and probing the model using its training on everything. It works really well but I have to force it to write out every line and write a proof that data in the future from the time of the wall clock didn't enter the system. Even then some stupid thing like not converting the timezone with daylight savings will allow it to peek into the future 1 hour. These types of bugs are almost impossible to find. Now there needs to be another agent whose only purpose to write out every line explaining that the timezone for that line of code was correct.

  • thin_carapace 4 hours ago
    > "If you can write the rules down, an agent will satisfy them faster than your team will."

    a fantastic opportunity to become the next next big thing and write a verifier verifier.

    at the hypothesized inflexion point where AI instantly performs exactly as commanded, what happens to heavily regulated industries like medical? do we get huge leaps and bounds everywhere EXCEPT where it matters, or is regulation going to be handed over to a verifier verifier?

    • _carbyau_ 3 hours ago
      > performs exactly as commanded

      The devil is in the details. There are an amazing number of details in a good [thing]. Someone somewhere has to say exactly what this [thing] being built actually is.

      Read almost any story about wishes from a genie. Simple statements don't work.

  • DeathArrow 2 hours ago
    Is this related to autoresearch? https://github.com/karpathy/autoresearch
  • bsder 1 hour ago
    > The frontier is the verifier.

    Um, yes? The big value that AMD had in the x86 market over competitors was their verification model. This has been known for decades.

    > 3-seed nextpnr P&R on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness

    Every single "improvement" is basically about routing around how absolutely abysmally bad the Gowin FPGAs are. Kudos to that, I guess?

    Gowin FPGAs have extraordinarily bad carry chain and block to block routing systems. They are literally so bad that a 32-bit ripple carry is almost as fast as the carry skip version even if you manually route it. Jump prediction is almost all about avoiding arithmetic computation at all (which most other FPGAs would have no problem with).

    Memory accesses are super slow and locked to clock edges rather than level sensitive (why ID/RF and WB take entire cycles and nothing optimization could do could change it). The additions are all routing around that (Note the immutability of the ID and WB phases).

    To top it off, the 5-stage pipeline is an annoying quirk of the RISC-V architecture having an immediate value offset on its load instruction. If the RISC-V load mandated 0 as the offset, the MEM read phase could overlap the RX phase since no ALU would be necessary (Store doesn't care because the result goes to memory rather than back to the register file so RF writeback isn't an issue). The absolutely horrific add performance of the Gowin FPGAs makes this acute.

    Finally, try to put this on a board. I found that anything above about 175MHz out of Nextpnr failed to execute on actual hardware (please correct me if this isn't valid. It's been over a year or more since I tried Nextpnr on the SiPeed Tang Primer 20K). That's simply right around where a 32-bit add plus some routing sits on these FPGAs. There's something a bit off in the timing analysis code for Nextpnr and the AI is almost certainly optimizing into it.

    That having been said: I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs. Now THAT would be a project worth throwing a couple of grad students and a bunch of subsidized AI tokens at.

  • marlburrow 1 hour ago
    [dead]
  • EverMemory 3 hours ago
    [dead]
  • qzgrid37 37 minutes ago
    [dead]