GLM-5.1: Towards Long-Horizon Tasks

(z.ai)

380 points | by zixuanlimit 7 hours ago

24 comments

dvt 3 minutes ago

Every single day, three things are becoming more and more clear:

    (1) OpenAI/Anthropic are absolutely cooked; it's clear they have no moat
    (2) Local/private inference is the future of AI
    (3) There's *still* no killer product yet (so get to work!)

simonw 2 hours ago
Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/
[-]
- ipsum2 2 hours ago
  It made it realistic. A pelican is much more likely to be flying in the sky than riding a bicycle.
- _pdp_ 1 hour ago
  Simon, you need to come up with improved benchmarks soon.
  [-]
  - lemonish97 52 minutes ago
    Agree. But you can keep the pelican theme in whatever new benchmark you choose to come up with. Iconic at this point.
Yukonv 6 hours ago
Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.
[0] https://huggingface.co/unsloth/GLM-5.1-GGUF
[-]
- zozbot234 5 hours ago
  SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.
  Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
  [-]
  - adrian_b 3 hours ago
    For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.
    [-]
    - QuantumNomad_ 3 hours ago
      Three hour coffee break while the LLM prepares scaffolding for the project.
      [-]
      - pbhjpbhj 1 hour ago
        Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.
        [-]
        drowsspa 50 minutes ago
        At least the compiler was free
      - cyanydeez 2 hours ago
        [flagged]
        [-]
        dcreater 1 hour ago
        @dang
    - zozbot234 3 hours ago
      Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.
alex7o 6 hours ago
To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.
[-]
- disiplus 5 hours ago
  When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).
  i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
- InsideOutSanta 3 hours ago
  I just set the context window to 100k and manage it actively (e.g. I compact it regularly or make it write out documentation of its current state and start a new session).
  For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
- MegagramEnjoyer 5 hours ago
  Why is that sad? A free and open source model outperforming their closed source counterparts is always a win for the users
  [-]
  - KaoruAoiShiho 5 hours ago
    The non-awesome context window is the sad part, but I think a better harness can deal with this.
- DeathArrow 5 hours ago
  After the context gets to 100k tokens you should open a new session or run /compact.
- cmrdporcupine 5 hours ago
  I honestly still hold onto habits from earlier days of Claude & Codex usage and tend to wipe / compact my context frequently. I don't trust the era of big giant contexts, frankly, even on the frontier models.
  [-]
  - calgoo 4 hours ago
    I also feel like its helping me on the big models these days with claude giving so many issues.
- varispeed 4 hours ago
  Isn't the same with opus nowadays?
johnfn 5 hours ago
GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.
[-]
- sourcecodeplz 5 hours ago
  Yep, haven't tried 5.1 but for my PHP coding, GLM-5 is 99% the same as Sonnet/Opus/GPT-5 levels. It is unbelievably strong for what it costs, not to mention you can run it locally.
- deepsquirrelnet 4 hours ago
  I am working on a large scale dataset for producing agent traces for Python <> cython conversion with tooling, and it is second only to gemini pro 3.1 in acceptance rates (16% vs 26%).
  Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).
  I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.
- epolanski 3 hours ago
  Same thing I noticed.
  My use cases are not code editing or authoring related, but when it comes to understanding a codebase and it's docs to help stakeholders write tasks or understand systems it has always outperformed american models at roughly half the price.
kamranjon 3 hours ago
I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.
minimaxir 5 hours ago
The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
dryarzeg 1 hour ago
A bit off-topic, but for some reason, even though I don't use LLMs for my job or for my hobbies, or in daily life frequently (and when I do, it's mostly some kind of "rubber duck brainstorm"), when I see open-weight releases like this one or the recent Gemma 4 (which is very good for local models); the first time was with DeepSeek-R1 (this one, despite being blamed for "censorship", was heavily censored only via DeepSeek API, the local model - full-weight 685B, not the distilled ones - was pretty much unhinged regarding censorship on any topic)... there's always one song coming to mind and I simply can't get rid of it no matter how hard I try.
"I am the storm that is approaching, provoking..." : )
winterqt 5 hours ago
Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?
[-]
- KaoruAoiShiho 5 hours ago
  Blog post is new but the model is about 2 weeks in public.
- stavros 4 hours ago
  My local tennis court's reservation website was broken and I couldn't cancel a reservation, and I asked GLM-5.1 if it can figure out the API. Five minutes later, I check and it had found a /cancel.php URL that accepted an ID but the ID wasn't exposed anywhere, so it found and was exploiting a blind SQL injection vulnerability to find my reservation ID.
  Overeager, but I was really really impressed.
  [-]
  - disiplus 4 hours ago
    Yeah it seems they did not align it to much, at least for now. Yesterday it helped me bypass the bot detection on a local marketplace. that i wanted to scrap some listing for my personal alerting system. Al the others failed but glm5.1 found a set of parameters and tweaks how to make my browser in container not be detected.
    [-]
    - ReptileMan 2 hours ago
      Model doing what the user wants with high quality is definitely aligned in my book.
  - arcanemachiner 4 hours ago
    That is both amazing and terrifying.
  - mikkupikku 3 hours ago
    Unfathomably based.
  - bglazer 4 hours ago
    This is insane, I love it.
- BeetleB 5 hours ago
  It's been out for a while.
RickHull 6 hours ago
I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.
[-]
- unicornfinder 6 hours ago
  I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.
  [-]
  - airstrike 5 hours ago
    100k tokens it's basically nothing these days. Claude Opus 4.6M with 1M context windows is just a different ball game
    [-]
    - wild_egg 4 hours ago
      The Dumb Zone for Opus has always started at 80-100k tokens. The 1M token window just made the dumb zone bigger. Probably fine if the work isn't complicated but really I never want an Opus session to go much beyond 100k.
    - plandis 4 hours ago
      Claude Opus can use a 1M context window but I’ve found it to degrade significantly past 250k in practice.
    - braebo 5 hours ago
      The cost per message increases with context while quality decreases so it’s still generally good to practice strategic context engineering. Even with cross-repo changes on enterprise systems, it’s uncommon to need more than 100k (unless I’m using playwright mcp for testing).
    - bredren 5 hours ago
      I had thought this, but my experience initially was that performance degradation began getting noticeable not long after crossing the old 250k barrier.
      So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
      This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
    - syntaxing 5 hours ago
      I’m genuinely surprised. I use copilot at work which is capped at 128K regardless of model and it’s a monorepo. Admittedly I know our code base really well so I can point towards different things quickly directly but I don’t think I ever needed compacting more than a handful in the past year. Let alone 1M tokens.
    - operatingthetan 5 hours ago
      The context windows of these Chinese open-source subscriptions (GLM, Minimax, Kimi) is too small and I'm guessing it's because they are trying to keep them cheap to run. Fine for openclaw, not so much for coding.
    - arcanemachiner 4 hours ago
      Personal opinions follow:
      Claude Opus at 150K context starts getting dumber and dumber.
      Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.
    - thawab 5 hours ago
      Don’t want to disappoint you, but above 200k opus memory is like a gold fish. You need to be below 150k to get good research and implementation.
      [-]
      - arcanemachiner 4 hours ago
        Oh nice, I just wrote pretty much the same comment above yours.
    - epolanski 3 hours ago
      Quality degrades fast with context length for all models.
      If you want quality you still have to compact or start new contextes often.
  - kay_o 5 hours ago
    Is manual compation absolutely mandatory ?
    [-]
    - jauntywundrkind 5 hours ago
      I haven't screenshotted to alas, but it goes from being a perfectly reasonable chatty LLM, to suddenly spewing words and nonsense characters around this threshold, at least for me as a z.ai pro (mid tier) user.
      For around a month the limit seemed to be a little over 60k! I was despondent!!
      What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
      More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
      But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
      I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
      There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.
      The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
      [-]
      - calgoo 4 hours ago
        I have gone back to having it create a todo.md file and break it into very small tasks. Then i just loop over each task with a clear context, and it works fine. a design.md or similar also helps, but most of the time i just have that all in a README.md file. I was also suspicious around the 100k almost to the token for it to start doing loops etc.
      - disiplus 4 hours ago
        basically my expirience as well. Sometimes it can break past 100k and be ok, but mostly it breaks down.
- kay_o 6 hours ago
  I am on the mid tier Coding plan to trying it out for the sake of curiosity.
  During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
  [-]
  - harias 3 hours ago
    Off peak for China or US
    [-]
    - kay_o 3 hours ago
      Off peak for China. Off peak times are only in one timezone
- InsideOutSanta 3 hours ago
  My impression is that different users get vastly different service, possibly based on location. I live in Western Europe, and it works perfectly for me. Never had a single timeout or noticeable quality degradation. My brother lives in East Asia, and it's unusable for him. Some days, it just literally does not work, no API calls are successful. Other days, it's slow or seems dumber than it should be.
- satvikpendem 5 hours ago
  Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.
- wolttam 5 hours ago
  This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent
- Mashimo 5 hours ago
  I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.
  But it's all casual side projects.
  Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.
- LaurensBER 5 hours ago
  I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.
  For the price this is a pretty damn impressive model.
- cmrdporcupine 5 hours ago
  Is there any advantage to their fixed payment plans at all vs just using this model via 3rd party providers via openrouter, given how relatively cheap they tend to be on a per-token basis?
  Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1
  $1.40 in $4.40 out $0.26 cached
  / 1M tokens
  That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
  I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
  [-]
  - Lalabadie 4 minutes ago
    I use GLM 5 Turbo sporadically for a client, and my Openrouter expense might climb over a dollar per day if I insist. At about 20 work days per month it's an easy choice.
- esafak 5 hours ago
  I'm on their Lite plan and I see some of this too. It is also slow. I use it as a backup.
- benterix 5 hours ago
  > Obvious quantization issues
  Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?
  [-]
  - cmrdporcupine 3 hours ago
    I think what Anthropic is doing is more subtle. It's less about quantizing and more about depth of thinking. They control it on their end and they're dynamically fiddling with those knobs.
- margorczynski 5 hours ago
  It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.
mark_l_watson 3 hours ago
I can’t wait to try it. I set up a new system this morning with OpenClaw and GLM-5, and I like GLM-5 as the backend for Claude Code. Excellent results.
blazespin 1 hour ago
Anthropic's reply? A model you can't use.
[-]
- minimaxir 1 hour ago
  Mythos is most definitely not in response to this announcement.
kirby88 5 hours ago
I wonder how that compare to harness methods like MAKER https://www.cognizant.com/us/en/ai-lab/blog/maker
gavinray 5 hours ago
I find the "8 hour Linux Desktop" bit disingenuous, in the fine print it's a browser page:
```
  > "build a Linux-style desktop environment as a web application"
```
They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.
We all know that building a spec-compliant browser alone is a herculean task.
[-]
- MrPowerGamerBR 3 hours ago
  In my opinion it would be way cooler if it actually created a real Linux desktop environment instead of only a replica.
  Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.
  I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.
- bredren 5 hours ago
  It is a big claim without the source and prompting.
DeathArrow 5 hours ago
I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.
tgtweak 4 hours ago
Share the harness for that browser linux OS task :)
epolanski 3 hours ago
I was very satisfied with GLM5, I'm not gonna lie.
Excited to test this.
bigyabai 6 hours ago
It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
[-]
- cassianoleal 5 hours ago
  I've done some very long sessions on OpenCode with Dynamic Context Pruning. Highly recommend it.
  https://github.com/Opencode-DCP/opencode-dynamic-context-pru...
- embedding-shape 6 hours ago
  > It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts
  Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?
  [-]
  - wolttam 6 hours ago
    long(er) contexts (than the previous model)
    It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.
    It's a fine model
    [-]
    - disiplus 4 hours ago
      i have glm and kimi. kimi was in most of the cases better and my replacement for claude when i run out of tokens. Now im finding myself using glm more then kimi. Its funny that glm vs kimi, is like codex vs claude. Where glm and codex are better for backend and kimi and claude more for frontend.
      as kimi did a huge amount of claude distilation it seems to be somewhat based in data
      https://www.anthropic.com/news/detecting-and-preventing-dist...
    - verdverm 6 hours ago
      Have you tried gemma4?
      I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.
- whimblepop 6 hours ago
  That's pretty few, at least for the way I'm currently using LLMs. I have them do some Nix work (both debugging and coding) where accuracy and quality matters to me, so they're instructed to behave as I would when it comes to docs, always consulting certain docs and source code in a specific order. It's not unusual for them to chew through 200k - 600k tokens in a single session before they solve everything I want them to. That's what I currently think of when I think of "long horizon within a single context window".
  So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.
- nkko 4 hours ago
  Yes, this is frustrating, but it doesn’t occur in CC. I run the conversation logs through an agent and opencode source, and it identified an issue in the reasoning implementation of opencode for Zai models. Consequently, I ceased my research and opted to use CC instead.
- jauntywundrkind 6 hours ago
  Chiming in to second this issue. It is wildly frustrating.
  I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).
  However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.
  Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.
  It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.
  I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...
  All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.
  [-]
  - ummzokbro 2 hours ago
    This.
    GLM5 also had this issue. When it was free on Openrouter / Kilo the model was rock solid though did degrade after 100k tokens gracefully. Same at launch with Zai aside from regular timeouts.
    Somewhere around early-mid March zai did something significant to GLM5 - like KV quanting or model quanting or both.
    After that it's been russian roulette. Sometimes it works flawlessly but very often (1/4 or 1/5 of the time) thinking tokens spill into main context and if you don't spot it happening it can do real damage - heavily corrupting files, deleting whole directories.
    You can see the pain by visiting the zai discord - filled with reports of the issue yet radio silence by zai.
    Tellingly despite being open source not a single provider will sell you access to this model at anything approaching the plans zai offers. The numbers just don't work so your choice is either pay per token significantly more and get reliability or put up with the bait and switch.
  - throwdbaaway 2 hours ago
    https://github.com/THUDM/IndexCache - Might be some expected issue when rolling out this. They don't have enough compute, and have to innovate.
  - girvo 2 hours ago
    This doesn’t help you, but GLM-5 stays coherent far longer on Alibaba’s coding plan/infra. You can’t get that coding plan anymore though unfortunately!
  - esseph 5 hours ago
    > "aside from totally going mad & speaking unpuncutaed gibberish [...] I trust it enormously."
    The bar is very low :(
    [-]
    - jauntywundrkind 5 hours ago
      I see where you are coming from.
      But I used 70m tokens yesterday on glm-5.1 (thanks glm for having good observability of your token usage unlike openai, dunno about anthropic). And got incredible beautiful results that I super trust. It's done amazing work.
      This limitation feels very shady and artificial to me, and i don't love this, but I also feel like I'm working somewhat effectively within the constraints. This does put a huge damper on people running more autonomous agentic systems, unless they have Pi or other systems that can more self adaptively improve the harness.
- HumanOstrich 5 hours ago
  I wonder if running the compaction in a degraded state produces a subpar summary to continue with.
  [-]
  - gunalx 2 hours ago
    Indeed it does. Once i see degraded state i revert to last task and run a compact, before starting up again.
- azuanrb 6 hours ago
  Have you compared it with using Claude Code as the harness? It performs much better than opencode for me.
jaggs 5 hours ago
How does it compare to Kimi 2.5 or Qwen 3.6 Plus?
[-]
- eis 5 hours ago
  The blog post has a benchmark comparison table with these two in it
  [-]
  - jaggs 4 hours ago
    Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]
- DeathArrow 5 hours ago
  Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.
dang 6 hours ago
[stub for offtopicness]
[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]
[-]
- smith7018 6 hours ago
  Hmm, three spam comments posted within 9 minutes of each other. The accounts were created 15 minutes ago, 51 days ago, and 3 months ago.
  Interesting.
  Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.
  [-]
  - dang 6 hours ago
    These comments are probably either by friends of the OP or perhaps associated with the project somehow, which is against HN's rules but not the kind of attack we're mostly concerned with these days. Old-fashioned voting rings and booster comments aren't existential threats and actually bring up somewhat nostalgic feelings at the moment!
    Thanks for watching out for the quality of HN...
    [-]
    - ray__ 5 hours ago
      Would love to read a Tell HN post about the kinds of attacks you are concerned with!
      [-]
      - dang 1 hour ago
        For example, there are rings of accounts posting generated comments, presumably in order to build karma for spammy or (let's be kind) promotional reasons. There are also plenty of spam rings that create tons of accounts and whatnot.
        These are different from the submitter-passed-a-link-to-friends kind of upvoting and booster comments, which feel quaint by comparison. In this case people usually don't know they are breaking HN's rules, which is why they don't try to hide it.
  - tadfisher 6 hours ago
    I moderate a medium-sized development subreddit. The sheer volume of spam advertising some AI SaaS company has skyrocketed over the past few months, like 10000%. Comment spam is now a service you can purchase [0][1], and I would not be surprised if Z.ai engaged some marketing firm which ended up purchasing this service.
    There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.
    [0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...
    [1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...
    [2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...
  - greenavocado 6 hours ago
    Z.ai Discord is filled to the brim with people experiencing capacity issues. I had to cancel my subscription with Z.ai because the service was totally unusable. Their Discord is a graveyard of failures. I switched to Alibaba Cloud for GLM but now they hiked their coding plan to $50 a month which is 2.5x more expensive than ChatGPT Plus. Totally insane.
    [-]
    - sourcecodeplz 5 hours ago
      Everyone has started either hiking their prices or limiting the tokens, gravy train is over. Glad we have open models that we can host; Sad RAM is so expensive..
- zendi 6 hours ago
  [flagged]
- louszbd 7 hours ago
  [flagged]
- seven2928 7 hours ago
  [flagged]
aplomb1026 6 hours ago
[dead]
EddyAI 4 hours ago
[dead]
andrewmcwatters 6 hours ago
[dead]
maxdo 4 hours ago
One of the bench maxed models . Every time I tried it , it’s not on par even with other open source models .
[-]
- wallmountedtv 1 hour ago
  Feeling very much the same. Attempting to use it through Claude Code as a model it just completely lost all context on what it was doing after a few months and kept short circuiting even with the most helpful prompts I could give, outside of just writing out the answer myself. I really do not get the praise for this model.
  Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.