> I realized I looked at this more from the angle of a hobbiest paying for these coding tools. Someone doing little side projects—not someone in a production setting. I did this because I see a lot of people signing up for $100/mo or $200/mo coding subscriptions for personal projects when they likely don’t need to.
Are people really doing that?
If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.
The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
I'm not cheap, just ahead of the curve. With the collapse in inference cost, everything will be this eventually
I'll basically do
$ man tool | <how do I do this with the tool>
or even
$ cat source | <find the flags and give me some documentation on how to use this>
Things I used to do intensively I now do lazily.
I've even made a IEITYuan/Yuan-embedding-2.0-en database of my manpages with chroma and then I can just ask my local documentation how I do something conceptually, get the man pages, inject them into local qwen context window using my mansnip llm preprocessor, forward the prompt and then get usable real results.
In practice it's this:
$ what-man "some obscure question about nfs"
...chug chug chug (about 5 seconds)...
<answer with citations back to the doc pages>
Essentially I'm not asking the models to think, just do NLP and process text. They can do that really reliably.
It helps combat a frequent tendency for documentation authors to bury the most common and useful flags deep in the documentation and lead with those that were most challenging or interesting to program instead.
I understand the inclination it's just not all that helpful for me
Is your RAG manpages thing on github somewhere? I was thinking about doing something like that (it's high on my to-do list but I haven't actually done anything with llms yet.)
The limits for the $20/month plan can be reached in 10-20 minutes when having it explore large codebases with directed. It’s also easy to blow right through the quota if you’re not managing content well (waiting until it fills up and then auto-compacting, or even using /compact frequently instead of /clear or the equivalent in different tools).
For most of my work I only need the LLM to perform a structured search of the codebase or to refactor something faster than I can type, so the $20/month plan is fine for me.
But for someone trying to get the LLM to write code for them, I could see the $20/month plans being exhausted very quickly. My experience with trying “vibecoding” style app development, even with highly detailed design documents and even providing test case expected output, has felt like lighting tokens on fire at a phenomenal rate. If I don’t interrupt every couple of commands and point out some mistake or wrong direction it can spin seemingly for hours trying to deal with one little problem after another. This is less obvious when doing something basic like a simple React app, but becomes extremely obvious once you deviate from material that’s represented a lot in training materials.
Not for Codex. Not even for Gemini/Antigravity! I am truly shocked by how much mileage I can get out of them. I recently bought the $200/mo OpenAI subscription but could barely use 10% of it. Now for over a month, I use codex for at least 2 hrs every day and have yet to reach the quota.
With Gemini/Antigravity, there’s the added benefit of switching to Claude Code Opus 4.5 once you hit your Gemini quota, and Google is waaaay more generous than Claude. I can use Opus alone for the entire coding session. It is bonkers.
So having subscribed to all three at their lowest subscriptions (for $60/mo) I get the best of each one and never run out of quota. I’ve also got a couple of open-source model subscriptions but I’ve barely had the chance to use them since Codex and Gemini got so good (and generous).
The fact that OpenAI is only spending 30% of their revenue on servers and inference despite being so generous is just mind boggling to me. I think the good times are likely going to last.
My advise - get Gemini + Codex lowest tier subscriptions. Add some credits to your codex subscription in case you hit the quota and can’t wait. You’ll never be spending over $100 even if you’re building complex apps like me.
> I recently bought the $200/mo OpenAI subscription but could barely use 10% of it
This entire comment is confusing. Why are you buying the $200/month plan if you’re only using 10% of it?
I rotate providers. My comment above applies to all of them. It really depends on the work you’re doing and the codebase. There are tasks where I can get decent results and barely make the usage bar move. There are other tasks where I’ve seen the usage bar jump over 20% for the session before I get any usable responses back. It really depends.
Not the same poster, but apparently they tried the $200/mo subscription, but after seeing they don't need it, they "subscribed to all three at their lowest subscriptions (for $60/mo)" instead.
Yes, we are doing that. These tools help make my personal projects come to life, and the money is well worth it. I can hit Claude Code limits within an hour, and there's no way I'm giving OpenAI my money.
As a third option, I've found I can do a few hours a day on the $20/mo Google plan. I don't think Gemini is quite as good as Claude for my uses, but it's good enough and you get a lot of tokens for your $20. Make sure to enable the Gemini 3 preview in gemini-cli though (not enabled by default).
Huge caveat: For the $20/mo subscription Google hasn't made clear if they train on your data. Anthropic and OAI on the other hand either clearly state they don't train on paid usage or offer very straightforward opt-outs.
> What is the privacy policy for using Gemini Code Assist or Gemini CLI if I’ve subscribed to Google AI Pro or Ultra?
> To learn more about your privacy policy and terms of service governed by your subscription, visit Gemini Code Assist: Terms of Service and Privacy Policies.
The last page only links to generic Google policies. If they didn't train on it, they could've easily said so, which they've done in other cases - e.g. for Google Studio and CLI they clearly say "If you use a billed API key we don't train, else we train". Yet for the Pro and Ultra subscriptions they don't say anything.
This also tracks with the fact that they enormously cripple the Gemini app if you turn off "apps activity" even for paying users.
If any Googlers read this, and you don't train on paying Pro/Ultra, you need to state this clearly somewhere as you've done with other products. Until then the assumption should be that you do train on it.
That's good to know, thanks. In my case nearly 100% of my code ends up public on GitHub, so I assume everyone's code models are training on it anyway. But would be worth considering if I had proprietary codebases.
To me, it doesn’t matter how cheap open AI codex is because that tool just burns up tokens, trying to switch to the wrong version of node using NVM on my machine. It spirals in a loop and never makes progress, for me, no matter how explicitly or verbosely i prompt.
On the other hand, Claude has been nothing but productive for me.
I’m also confused why you don’t assume people have the intelligence to only upgrade when needed. Isn’t that what we’re all doing? Why would you assume people would immediately sign up for the most expensive plan that they don’t need? I already assumed everyone starts on the lowest plan and quickly runs into session limits and then upgrades.
Also coaching people on which paid plan to sign up for kinda has nothing to do with running a local model, which is what this article is about
I spent about 45 mins trying to get both Claude and ChatGPT to help get Codex running on my machine (WSL2) and on a Linux NUC, they couldn't help me get it working so I gave up and went back to Claude.
Because somewhere inside its little non-deterministic brain, the phrase "switch to node version xxx" was the most probable response to the previous context.
Me. Currently using Claude Max for personal coding projects. I've been on Claude's $20 plan and would run out of tokens. I don't want to give my money to OpenAI. So far these projects have not returned their value back to me, but I am viewing it as an investment in learning best pratices with these coding tools.
What I find perplexing is the very respectful people that pay those subscriptions to produce clearly sub-par work I'm sure they wouldn't have done themselves.
And when pressed on “this doesn't make sense, are you sure this works?” they ask the model to answer, it gets it wrong, and they leave it at that.
When you look at how capable Claude is, vs the salary of even a fresh graduate, combined with how expensive your time is… Even the maximum plan is a super good deal.
Claude's $20 plan should be renamed to "trial". Try Opus and you will reach your limit in 10 minutes. With Sonnet, if you aren't clearing the context very often, you'll hit it within a few hours. I'm sympathetic to developers who are using this as their only AI subscription because while I was working on a challenging bug yesterday I reached the limit before it had even diagnosed the problem and had to switch to another coding agent to take over. I understand you can't expect much from a $20 subscription, but the next jump up costing $80 is demotivating.
Session limit that resets after 5 hours timed from the first message you sent. Most people I’ve seen report between 1 to 2 hours of dev time using Opus 4.5 on the Pro plan before hitting it unless you’re feeding in huge files and doing a bad job of managing your context.
I half agree, but it should be called “Hobbiest” since that’s what it’s good for. 10 minutes is hyperbolic, I average 1h30m even when using plan mode first and front loading the context with dev diaries, git history, milestone documents and important excerpts from previous conversations. Something tells me your modules might be too big and need refactoring. That said, it’s a pain having to wait hours between sessions and jump when the window opens to make sure I stay on schedule and can get three in a day but that works ok for hobby projects since I can do other things in between. I would agree that if you’re using it for work you absolutely need Max so that should be what’s called the Pro plan but what can you do? They chose the names so now we just need to add disclaimers.
the only thing that matters is whether or not you are getting your money’s worth. nothing else matters. if claude is worth $100 or $200 per month to you, it is an easy decision to pay. otherwise stick with $20 or nothing
Short answer is yes. Not only is it more token-friendly and potentially lower latency, it also prevents weird context issues like forgetting Rules, compacting your conversation and missing relevant details, etc.
> The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
leo dicaprio snapping gif
These kinds of articles should focus on use case because mileage may vary depending on maturity of idea, testing and host of other factors.
If the app, service, or whatever is unproven, that's a sunk cost on macbook vs. 4 weeks to validate an idea which is a pretty long time.
And as a hobbyist the time to sign up for the $20/month plan is after you've spent $20 on tokens at least a couple times.
YMMV based on the kinds of side projects you do, but it's definitely been cheaper for me in the long run to pay by token, and the flexibility it offers is great.
> If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic.
> The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
These are the same people, by and large. What I have seen is users who purely vibe code everything and run into the limits of the $20/m models and pay up for the more expensive ones. Essentially they're trading learning coding (and time, in some cases, it's not always faster to vibe code than do it yourself) for money.
If this is the new way code is written then they are arguably learning how to code. Jury is still out though, but I think you are being a bit dismissive.
I wouldn't change definitions like that just because the technology changed, I'm talking about the ability to analyze control flow and logic, not necessarily put code on the screen. What I've seen from most vibe coders is that they don't fully understand what's going on. And I include myself, I tried it for a few months and the code was such garbage after a while that I scrapped it and redid it myself.
I've been a software developer for 25 years, and 30ish years in the industry, and have been programming my whole life. I worked at Google for 10 of those years. I work in C++ and Rust. I know how to write code.
I don't pay $100 to "vibe code" and "learn to program" or "avoid learning to program."
I pay $100 so I can get my personal (open source) projects done faster and more completely without having to hire people with money I don't have.
I'm talking about the general trend, not the exceptions. How much of the code do you manually write with the 100 dollar subscription? Vibe coding is a descriptive, not a prescriptive, label.
I review all of it, but hand write little of it. It's bizarre how I've ended up here, but yep.
That said, I wouldn't / don't trust it with something from scratch, I only trust it to do that because I built -- by hand -- a decent foundation for it to start from.
Sure, you're like me, you're not a vibe coder by the actual definition then. Still, the general trend I see is that a lot of actual vibe coders do try to get their product working, code quality be damned. Personally, same as you, I stopped vibe coding and actually started writing a lot of architecture and code myself first then allowing the LLM to fill in the features so to speak.
Came here to write something similar (Of course, other than working in Google) and saw your comments reflecting my views.
Yes, Its worth pending $200/month on Claude to get my personal project ideas come to life with better quality and finish.
Time is my limiting factor, especially on personal projects. To me, this makes any multiplying effect valuable.
When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
From my personal experience it's around 50:50 between Claude and Codex. Some people strongly prefer one over the other. I couldn't figure out yet why.
I just can't accept how slow codex is, and that you can't really use it interactively because of that. I prefer to just watch Claude code work and stop it once I don't like the direction it's taking.
From my point of view, you're either choosing between instruction following or more creative solutions.
Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.
Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.
I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things
bit the bullet this week and paid for a month of claude and a month of chatgpt plus. claude seems to have much lower token limits, both aggregate and rate-limited and GPT-5.2 isn't a bad model at all. $20 for claude is not enough even for a hobby project (after one day!), openai looks like it might be.
I feel like a lot of the criticism the GPT-5.x models receive only applies to specific use cases. I prefer these models over Anthropic's because they are less creative and less likely to take freedoms interpreting my prompts.
Sonnet 4.5 is great for vibe coding. You can give it a relatively vague prompt and it will take the initiative to interpret it in a reasonable way. This is good for non-programmers who just want to give the model a vague idea and end up with a working, sensible product.
But I usually do not want that, I do not want the model to take liberties and be creative. I want the model to do precisely what I tell it and nothing more. In my experience, te GPT-5.x models are a better fit for that way of working.
I’ve been using vs code copilot pro for a few months and never really had any issue, once you hit the limit for one model, you generally still have a bunch more models to choose from. Unless I was vibe coding massive amounts of code without looking to testing, it’s hard to imagine I will run out of all the available pro models.
It helps that Codex is so much slower than Anthropic models, a 4.5 hours Codex session might as well be a 2 hour Claude Code one. I use both extensively FWIW.
It really depends. When building a lot of new features it happens quite fast. With some attention to context length I was often able to go for over an hour on the 20$ claude plan.
If you're doing mostly smaller changes, you can go all day with the 20$ Claude plan without hitting the limits. Especially if you need to thoroughly review the AI changes for correctness, instead of relying on automated tests.
I find that I use it on isolated changes where Claude doesn’t really need to access a ton of files to figure out what to do and I can easily use it without hitting limits. The only time I hit the 4-5 hour limit is when I’m going nuts on a prototype idea and vibe coding absolutely everything, and usually when I hit the limit, I’m pretty mentally spent anyway so I use it as a sign to go do something else. I suppose everyone has different styles and different codebases, but for me I can pretty easily stay under the limit without that it’s hard to justify $100 or $200 a month.
this, provided you don't mind hopping around a lot, 5 20 dollar a month accounts will get you way more tokens typically, also good free models will show up from time to time on openrouter
Codex $20 is a good deal but they have nothing inbetween $20 and $200.
The $20 Anthropic plan is only enough to wet my appetite, I can't finish anything.
I pay for $100 Anthropic plan, and keep a $20 Codex plan in my back pocket for getting it to do additional review and analysis overtop of what Opus cooks up.
And I have a few small $ of misc credits in DeepSeek and Kimi K2 AI services mainly to try them out, and for tasks that aren't as complicated, and for writing my own agent tools.
I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.
Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...
Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)
I agree with everything you said, and yet I cannot help but respect a person who wants to do it himself. It reminds me of the hacker culture of the 80s and 90s.
Agreed,
Everyone seems to shun the DIY hacker now a days; saying things like “I’ll just pay for it”.
It’s not about just NOT paying for it but doing it yourself and learning how to do it so that you can pass the knowledge on and someone else can do it.
My 2023 Macbook Pro (M2 Max) is coming up to 3 years old and I can run models locally that are arguably "better" than what was considered SOTA about 1.5 years ago. This is of course not an exact comparison but it's close enough to give some perspective.
> I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.
Well, the hardware remains the same but local models get better and more efficient, so I don't think there is much difference between paying 5k for online models over 5 years vs getting a laptop (and well, you'll need a laptop anyway, so why not just get a good enough one to run local models in the first place?).
Even if intelligence scaling stays equal, you'll lose out on speed. A sota model pumping 200 tk/s is going to be impossible to ignore with a 4 year old laptop choking itself at 3 tk/s.
Even still, right now is when the first gen of pure LLM focused design chipsets are getting into data centers.
At a certain point, tokens per second stop mattering because the time to review stays constant. Whether it shits out 200 tokens a second versus 20, it doesn't much matter if you need to review the code that does come out.
If you have inference running on this new 128GB RAM Mac, wouldn't you still need another separate machine to do the manual work (like running IDE, browsers, toolchains, builders/bundlers etc.)? I can not imagine you will have any meaningful RAM available after LLM models are running.
No? First of all you can limit how much of the unified RAM goes into VRAM, and second, many applications don't need that much RAM. Even if you put 108 GB to VRAM and 16 to applications, you'll be fine.
Is that really the case? This summer there was "Frontier AI performance becomes accessible on consumer hardware within a year" [1] which makes me think it's a mistake to discount the open weights models.
But for SOTA performance you need specialized hardware. Even for Open Weight models.
40k in consumer hardware is never going to compete with 40k of AI specialized GPUs/servers.
Your link starts with:
> "Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just 6 to 12 months ago."
I highly doubt a RTX 5090 can run anything that competes with Sonnet 3.5 which was released June, 2024.
With RAM prices spiking, there's no way consumers are going to have access to frontier quality models on local hardware any time soon, simply because they won't fit.
That's not the same as discounting the open weight models though. I use DeepSeek 3.2 heavily, and was impressed by the Devstral launch recently. (I tried Kimi K2 and was less impressed). I don't use them for coding so much as for other purposes... but the key thing about them is that they're cheap on API providers. I put $15 into my deepseek platform account two months ago, use it all the time, and still have $8 left.
I think the open weight models are 8 months behind the frontier models, and that's awesome. Especially when you consider you can fine tune them for a given problem domain...
I don’t think I’ve ever read an article where the reason I knew the author was completely wrong about all of their assumptions was that they admitted it themselves and left the bad assumptions in the article.
The above paragraph is meant to be a compliment.
But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.
Also in the case of his father where he is working for a company that must use a self hosted model or any other company that needed it, would a $10K Mac Studio with 512GB RAM be worth it? What about two Mac Studios connected over Thunderbolt using the newly released support in macOS 26?
This story talks about MLX and Ollama but doesn't mention LM Studio - https://lmstudio.ai/
LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models
I suspect Ollama is at least partly moving away open source as they look to raise capitol, when they released their replacement desktop app they did so as closed source. You're absolutely right that people should be using llama.cpp - not only is it truly open source but it's significantly faster, has better model support, many more features, better maintained and the development community is far more active.
LMStudio? No, it's the easiest way to run am LLM locally that I've seen to the point where I've stopped looking at other alternatives.
It's cross-platform (Win/Mac/Linux), detects the most appropriate GPU in your system and tells you whether the model you want to download will run within it's RAM footprint.
It lets you set up a local server that you can access through API calls as if you were remotely connected to an online service.
The tradeoff is a somewhat higher learning curve, since you need to manually browse the model library and choose the model/quantization that best fit your workflow and hardware. OTOH, it's also open-source unlike LMStudio which is proprietary.
Most LLM sites are now offering free plans, and they are usually better than what you can run locally, So I think people are running local models for privacy 99% of the time
"This particular [80B] model is what I’m using with 128GB of RAM". The author then goes on to breezily suggest you try the 4B model instead of you only have 8GB of RAM. With no discussion of exactly what a hit in quality you'll be taking doing that.
This is like if an article titled "A guide to growing your own food instead of buying produce" explained that the author was using a four-acre plot of farmland but suggested that that reader could also use a potted plant instead. Absolutely baffling.
The money argument is IMHO not super strong, here as that Mac depreciates more per month than the subscription they want to avoid.
There may be other reasons to go local, but I would say that the proposed way is not cost effective.
There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.
The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.
In my experience the latest models (Opus 4.5, GPT 5.2) Are _just_ starting to keep up with the problems I'm throwing at them, and I really wish they did a better job, so I think we're still 1-2 years away from local models not wasting developer time outside of CRUD web apps.
Eh, these things are trained on existing data. The further you are from that the worse the models get.
I've noticed that I need to be a lot more specific in those cases, up to the point where being more specific is slowing me down, partially because I don't always know what the right thing is.
For sure, and I guess that's kind of my point -- if the OP says local coding models are now good enough, then it's probably because he's using things that are towards the middle of the distribution.
similar for me —- also how do you get the proper double dashes —- anyway, I’d love to be able to run CLI agents fully local, but I don’t see it being good enough (relative to what you can get for pretty cheap from SOTA models) anytime soon
Buying a maxed out MacBook Pro seems like the most expensive way to go about getting the necessary compute. Apple is notorious for overcharging for hardware, especially on ram.
I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.
Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.
On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.
These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.
You also have limited upgradeability anyway - the RAM is soldered.
I wouldn't run local models on the development PC. Instead run them on a box in another room or another location. Less fan noise and it won't influence the performance of the pc you're working on.
Latency is not an issue at all for LLMs, even a few hundred ms won't matter.
It doesn't make a lot of sense to me, except when working offline while traveling.
Its interesting to notice that here https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... we default to measuring LLM coding performance as how long[~5h] a human task a model can complete with 50% success-rate (with 80% fall back for the second chart [~.5h]), while here it seems that for actual coding we really care about the last 90-100% of the costly model's performance.
Cline + RooCode and VSCode already works really well with local models like qwen3-coder or even the latest gpt-oss. It is not as plug-and-play as Claude but it gets you to a point where you only have to do the last 5% of the work
I use it to build some side-projects, mostly apps for mobile devices. It is really good with Swift for some reason.
I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own
Under current prices buying hardware just to run local models is not worth it EVER, unless you already need the hardware for other reasons or you somehow value having no one else be able to possibly see your AI usage.
Let's be generous and assume you are able to get a RTX 5090 at MSRP ($2000) and ignore the rest of your hardware, then run a model that is the optimal size for the GPU. A 5090 has one of the best throughputs in AI inference for the price, which benefits the local AI cost-efficiency in our calculations. According to this reddit post it outputs Qwen2.5-Coder 32B at 30.6 tokens/s.
https://www.reddit.com/r/LocalLLaMA/comments/1ir3rsl/inferen...
It's probably quantized, but let's again be generous and assume it's not quantized any more than models on OpenRouter. Also we assume you are able to keep this GPU busy with useful work 24/7 and ignore your electricity bill. At 30.6 tokens/s you're able to generate 993M output tokens in a year, which we can conveniently round up to a billion.
Currently the cheapest Qwen2.5-Coder 32B provider on OpenRouter that doesn't train on your input runs it at $0.06/M input and $0.15/M output tokens. So it would cost $150 to serve 1B tokens via API. Let's assume input costs are similar since providers have an incentive to price both input and output proportionately to cost, so $300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly.
Conclusion: even with EVERY assumption in favor of the local GPU user, it still takes almost 7 years for running a local LLM to become worth it. (This doesn't take into account that API prices will most likely decrease over time, but also doesn't take into account that you can sell your GPU after the breakeven period. I think these two effects should mostly cancel out.)
In the real world in OP's case, you aren't running your model 24/7 on your MacBook; it's quantized and less accurate than the one on OpenRouter; a MacBook costs more and runs AI models a lot slower than a 5090; and you do need to pay electricity bills. If you only change one assumption and run the model only 1.5 hours a day instead of 24/7, then the breakeven period already goes up to more than 100 years instead of 7 years.
Basically, unless you absolutely NEED a laptop this expensive for other reasons, don't ever do this.
I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.
Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.
Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.
For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.
If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.
I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?
I’ve been using Qwen3 Coder 30b quantized down to IQ3_XSS to fit in < 16gb vram. Blazing fast 200+ tokens per second on a 4080. I don’t ask anything complicated, but one off scripts to do something I’d normally have to do manually by hand or take an hour to write the script myself? Absolutely.
These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.
My takeaway is that clock is ticking on Claude, Codex et al's AI monopoly. If a local setup can do 90% of what Claude can do today, what do things look like in 5 years?
I think they have already realized this, which is why they are moving towards tool use instead of text generation. Also explains why there are no more free APIs nowadays (even for search)
I do not spend $100/month. I spend for 1 Claude Pro subscription and then a (much cheaper) z.ai Coding Plan, which is like one fifth the cost.
I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).
I have heard about this approach elsewhere too. Could you please provide some more details on the set up steps and usage approach. I would like to replicate. Thanks.
I just got a RTX 5090, so I thought I'd see what all the fuss was about these AI coding tools. I've previously copy pasted back and forth from Claude but never used the instruct models.
So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.
I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.
Can anyone give any tips for getting something that runs fairly fast under ollama? It doesn't have to be very intelligent.
When I tried gpt-oss and qwen using ollama on an M2 Mac the main problem was that they were extremely slow. But I did have a need for a free local model.
If privacy is your top priority, then sure spend a few grand on hardware and run everything locally.
Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)
I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind
That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.
I think the long term will depends on the legal/rent-seeking side.
Imagine having the hardware capacity to run things locally, but not the necessary compliance infrastructure to ensure that you aren't committing a felony under the Copyright Technofeudalism Act of 2030.
This is not really a guide to local coding models which is kinda disappointing. Would have been interested in a review of all the cutting edge open weight models in various applications.
Some just like privacy and working without internet, I for example travel regularly on the train and like to have my laptop when there's not always good WiFi.
Are people really so naive to think that the price/quality of proprietary models is going to stay the same forever? I would guess sometime in the next 2-3 years all of the major AI companies are going to increase the price/enshittify their models to the point where running local models is really going to be worth it.
Not worth it yet. I run a 6000 black for image and video generation, but local coding models just aren't on the same level as the closed ones.
I grabbed Gemini for $10/month during Black Friday, GPT for $15, and Claude for $20. Comes out to $45 total, and I never hit the limits since I toggle between the different models. Plus it has the benefit of not dumping too much money into one provider or hyper focusing on one model.
That said, as soon as an open weight model gets to the level of the closed ones we have now, I'll switch to local inference in a heartbeat.
> Imagine buying hardware that will be obsolete in 2 years
Unless the PC you buy is more than $4,800 (24 x $200) it is still a good deal. For reference, a MacBook M4 Max with 128GB of unified RAM is $4,699. You need a computer for development anyway, so the extra you pay for inference is more like $2-3K.
Besides, it will still run the same model(s) at the same speed after that period, or even maybe faster with future optimisations in inference.
Are people really doing that?
If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.
The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
My monthly spend on ai models is < $1
I'm not cheap, just ahead of the curve. With the collapse in inference cost, everything will be this eventually
I'll basically do
or even Things I used to do intensively I now do lazily.I've even made a IEITYuan/Yuan-embedding-2.0-en database of my manpages with chroma and then I can just ask my local documentation how I do something conceptually, get the man pages, inject them into local qwen context window using my mansnip llm preprocessor, forward the prompt and then get usable real results.
In practice it's this:
Essentially I'm not asking the models to think, just do NLP and process text. They can do that really reliably.It helps combat a frequent tendency for documentation authors to bury the most common and useful flags deep in the documentation and lead with those that were most challenging or interesting to program instead.
I understand the inclination it's just not all that helpful for me
If you aren't using coding models you aren't ahead of the curve.
There are free coding models. I use them heavily. They are ok but only partial substitutes for frontier models.
For most of my work I only need the LLM to perform a structured search of the codebase or to refactor something faster than I can type, so the $20/month plan is fine for me.
But for someone trying to get the LLM to write code for them, I could see the $20/month plans being exhausted very quickly. My experience with trying “vibecoding” style app development, even with highly detailed design documents and even providing test case expected output, has felt like lighting tokens on fire at a phenomenal rate. If I don’t interrupt every couple of commands and point out some mistake or wrong direction it can spin seemingly for hours trying to deal with one little problem after another. This is less obvious when doing something basic like a simple React app, but becomes extremely obvious once you deviate from material that’s represented a lot in training materials.
With Gemini/Antigravity, there’s the added benefit of switching to Claude Code Opus 4.5 once you hit your Gemini quota, and Google is waaaay more generous than Claude. I can use Opus alone for the entire coding session. It is bonkers.
So having subscribed to all three at their lowest subscriptions (for $60/mo) I get the best of each one and never run out of quota. I’ve also got a couple of open-source model subscriptions but I’ve barely had the chance to use them since Codex and Gemini got so good (and generous).
The fact that OpenAI is only spending 30% of their revenue on servers and inference despite being so generous is just mind boggling to me. I think the good times are likely going to last.
My advise - get Gemini + Codex lowest tier subscriptions. Add some credits to your codex subscription in case you hit the quota and can’t wait. You’ll never be spending over $100 even if you’re building complex apps like me.
This entire comment is confusing. Why are you buying the $200/month plan if you’re only using 10% of it?
I rotate providers. My comment above applies to all of them. It really depends on the work you’re doing and the codebase. There are tasks where I can get decent results and barely make the usage bar move. There are other tasks where I’ve seen the usage bar jump over 20% for the session before I get any usable responses back. It really depends.
This is why it’s confusing, though. Why start with the highest plan as the starting point when it’s so easy to upgrade?
It's worth noting that the Claude subscription seems notably less than the others.
Also there are good free options for code review.
https://geminicli.com/docs/faq/
> What is the privacy policy for using Gemini Code Assist or Gemini CLI if I’ve subscribed to Google AI Pro or Ultra?
> To learn more about your privacy policy and terms of service governed by your subscription, visit Gemini Code Assist: Terms of Service and Privacy Policies.
> https://developers.google.com/gemini-code-assist/resources/p...
The last page only links to generic Google policies. If they didn't train on it, they could've easily said so, which they've done in other cases - e.g. for Google Studio and CLI they clearly say "If you use a billed API key we don't train, else we train". Yet for the Pro and Ultra subscriptions they don't say anything.
This also tracks with the fact that they enormously cripple the Gemini app if you turn off "apps activity" even for paying users.
If any Googlers read this, and you don't train on paying Pro/Ultra, you need to state this clearly somewhere as you've done with other products. Until then the assumption should be that you do train on it.
On the other hand, Claude has been nothing but productive for me.
I’m also confused why you don’t assume people have the intelligence to only upgrade when needed. Isn’t that what we’re all doing? Why would you assume people would immediately sign up for the most expensive plan that they don’t need? I already assumed everyone starts on the lowest plan and quickly runs into session limits and then upgrades.
Also coaching people on which paid plan to sign up for kinda has nothing to do with running a local model, which is what this article is about
And when pressed on “this doesn't make sense, are you sure this works?” they ask the model to answer, it gets it wrong, and they leave it at that.
Incidentally, wondering if anyone has seen this approach of asking Claude to manage Codex:
https://www.reddit.com/r/codex/comments/1pbqt0v/using_codex_...
That hasn't been true with Opus 4.5. I usually hit my limit after an hour of intense sessions.
Do you mean that users should start a new chat for every new task, to save tokens? Thanks.
leo dicaprio snapping gif
These kinds of articles should focus on use case because mileage may vary depending on maturity of idea, testing and host of other factors.
If the app, service, or whatever is unproven, that's a sunk cost on macbook vs. 4 weeks to validate an idea which is a pretty long time.
If the idea is sound then run it on macbook :)
YMMV based on the kinds of side projects you do, but it's definitely been cheaper for me in the long run to pay by token, and the flexibility it offers is great.
> The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
These are the same people, by and large. What I have seen is users who purely vibe code everything and run into the limits of the $20/m models and pay up for the more expensive ones. Essentially they're trading learning coding (and time, in some cases, it's not always faster to vibe code than do it yourself) for money.
I don't pay $100 to "vibe code" and "learn to program" or "avoid learning to program."
I pay $100 so I can get my personal (open source) projects done faster and more completely without having to hire people with money I don't have.
I review all of it, but hand write little of it. It's bizarre how I've ended up here, but yep.
That said, I wouldn't / don't trust it with something from scratch, I only trust it to do that because I built -- by hand -- a decent foundation for it to start from.
Not a serious question but I thought it's an interesting way of looking at value.
I used to sell cars in SF. Some people wouldn't negotiate over $50 on a $500 a month lease because their apartment was $4k anyway.
Other people WOULD negotiate over $50 because their apartment was $4k.
When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
(I also have the same MBP the author has and have used Aider with Qwen locally.)
I just can't accept how slow codex is, and that you can't really use it interactively because of that. I prefer to just watch Claude code work and stop it once I don't like the direction it's taking.
Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.
Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.
I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things
Sonnet 4.5 is great for vibe coding. You can give it a relatively vague prompt and it will take the initiative to interpret it in a reasonable way. This is good for non-programmers who just want to give the model a vague idea and end up with a working, sensible product.
But I usually do not want that, I do not want the model to take liberties and be creative. I want the model to do precisely what I tell it and nothing more. In my experience, te GPT-5.x models are a better fit for that way of working.
Claude Code is a whole lot less generous though.
If you're doing mostly smaller changes, you can go all day with the 20$ Claude plan without hitting the limits. Especially if you need to thoroughly review the AI changes for correctness, instead of relying on automated tests.
If I wasn't only using it for side projects I'd have to cough up the $200 out of necessity.
The $20 Anthropic plan is only enough to wet my appetite, I can't finish anything.
I pay for $100 Anthropic plan, and keep a $20 Codex plan in my back pocket for getting it to do additional review and analysis overtop of what Opus cooks up.
And I have a few small $ of misc credits in DeepSeek and Kimi K2 AI services mainly to try them out, and for tasks that aren't as complicated, and for writing my own agent tools.
$20 Claude doesn't go very far.
Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...
Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)
Well, the hardware remains the same but local models get better and more efficient, so I don't think there is much difference between paying 5k for online models over 5 years vs getting a laptop (and well, you'll need a laptop anyway, so why not just get a good enough one to run local models in the first place?).
Even still, right now is when the first gen of pure LLM focused design chipsets are getting into data centers.
[1] https://epoch.ai/data-insights/consumer-gpu-model-gap
But for SOTA performance you need specialized hardware. Even for Open Weight models.
40k in consumer hardware is never going to compete with 40k of AI specialized GPUs/servers.
Your link starts with:
> "Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just 6 to 12 months ago."
I highly doubt a RTX 5090 can run anything that competes with Sonnet 3.5 which was released June, 2024.
That's not the same as discounting the open weight models though. I use DeepSeek 3.2 heavily, and was impressed by the Devstral launch recently. (I tried Kimi K2 and was less impressed). I don't use them for coding so much as for other purposes... but the key thing about them is that they're cheap on API providers. I put $15 into my deepseek platform account two months ago, use it all the time, and still have $8 left.
I think the open weight models are 8 months behind the frontier models, and that's awesome. Especially when you consider you can fine tune them for a given problem domain...
The above paragraph is meant to be a compliment.
But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.
Also in the case of his father where he is working for a company that must use a self hosted model or any other company that needed it, would a $10K Mac Studio with 512GB RAM be worth it? What about two Mac Studios connected over Thunderbolt using the newly released support in macOS 26?
https://news.ycombinator.com/item?id=46248644
LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models
but people should use llama.cpp instead
and why should that affect usage? it's not like ollama users fork the repo before installing it.
MLX is a lot more performant than Ollama and llama.cpp on Apple Silicon, comparing both peak memory usage + tok/s output.
edit: LM Studio benefits from MLX optimizations when running MLX compatible models.
It's cross-platform (Win/Mac/Linux), detects the most appropriate GPU in your system and tells you whether the model you want to download will run within it's RAM footprint.
It lets you set up a local server that you can access through API calls as if you were remotely connected to an online service.
- Cross-platform
- Sets up a local API server
The tradeoff is a somewhat higher learning curve, since you need to manually browse the model library and choose the model/quantization that best fit your workflow and hardware. OTOH, it's also open-source unlike LMStudio which is proprietary.
[edit] Oh and apparently you can also directly run some models directly from HuggingFace: https://huggingface.co/docs/hub/ollama
I mean, what's the point of using local models if you can't trust the app itself?
and you think ollama doesn't do telemetry/etc. just because it's open source?
There may be other reasons to go local, but I would say that the proposed way is not cost effective.
There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.
The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.
I've noticed that I need to be a lot more specific in those cases, up to the point where being more specific is slowing me down, partially because I don't always know what the right thing is.
I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.
Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.
On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.
These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.
You also have limited upgradeability anyway - the RAM is soldered.
Not an Apple fanboy, but I was under the impression that having access to up to 512GB usable GPU memory was the main feature in favour of the mac.
And now with Exo, you can even break the 512GB barrier.
Latency is not an issue at all for LLMs, even a few hundred ms won't matter.
It doesn't make a lot of sense to me, except when working offline while traveling.
There may be other reasons to go local, but the proposed way is not cost effective.
I didn’t try it long because I got frustrated waiting for it to spit out wrong answers.
But I’m open to trying again.
I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own
Let's be generous and assume you are able to get a RTX 5090 at MSRP ($2000) and ignore the rest of your hardware, then run a model that is the optimal size for the GPU. A 5090 has one of the best throughputs in AI inference for the price, which benefits the local AI cost-efficiency in our calculations. According to this reddit post it outputs Qwen2.5-Coder 32B at 30.6 tokens/s. https://www.reddit.com/r/LocalLLaMA/comments/1ir3rsl/inferen...
It's probably quantized, but let's again be generous and assume it's not quantized any more than models on OpenRouter. Also we assume you are able to keep this GPU busy with useful work 24/7 and ignore your electricity bill. At 30.6 tokens/s you're able to generate 993M output tokens in a year, which we can conveniently round up to a billion.
Currently the cheapest Qwen2.5-Coder 32B provider on OpenRouter that doesn't train on your input runs it at $0.06/M input and $0.15/M output tokens. So it would cost $150 to serve 1B tokens via API. Let's assume input costs are similar since providers have an incentive to price both input and output proportionately to cost, so $300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly.
Conclusion: even with EVERY assumption in favor of the local GPU user, it still takes almost 7 years for running a local LLM to become worth it. (This doesn't take into account that API prices will most likely decrease over time, but also doesn't take into account that you can sell your GPU after the breakeven period. I think these two effects should mostly cancel out.)
In the real world in OP's case, you aren't running your model 24/7 on your MacBook; it's quantized and less accurate than the one on OpenRouter; a MacBook costs more and runs AI models a lot slower than a 5090; and you do need to pay electricity bills. If you only change one assumption and run the model only 1.5 hours a day instead of 24/7, then the breakeven period already goes up to more than 100 years instead of 7 years.
Basically, unless you absolutely NEED a laptop this expensive for other reasons, don't ever do this.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.
Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.
For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.
If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.
These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.
I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).
So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.
I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.
When I tried gpt-oss and qwen using ollama on an M2 Mac the main problem was that they were extremely slow. But I did have a need for a free local model.
Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)
I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind
That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.
If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.
Best choice will depend on use cases.
Imagine having the hardware capacity to run things locally, but not the necessary compliance infrastructure to ensure that you aren't committing a felony under the Copyright Technofeudalism Act of 2030.
I grabbed Gemini for $10/month during Black Friday, GPT for $15, and Claude for $20. Comes out to $45 total, and I never hit the limits since I toggle between the different models. Plus it has the benefit of not dumping too much money into one provider or hyper focusing on one model.
That said, as soon as an open weight model gets to the level of the closed ones we have now, I'll switch to local inference in a heartbeat.
Unless the PC you buy is more than $4,800 (24 x $200) it is still a good deal. For reference, a MacBook M4 Max with 128GB of unified RAM is $4,699. You need a computer for development anyway, so the extra you pay for inference is more like $2-3K.
Besides, it will still run the same model(s) at the same speed after that period, or even maybe faster with future optimisations in inference.