It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
This is a pretty wild take. What percentage of human engineers are creating novel solutions for hard problems, you think? I work in R&D and even my work is 90% doing things that other people already solved. If you are really doing cutting edge SOTA work that has never been done by another human in some form or another, then kudos to you and I want your job.
> What percentage of human engineers are creating novel solutions for hard problems, you think?
IMO Every engineer should try spending his time in a company that tries to solve new problems.
Otherwise we will be stuck, as we are now, with big tech paying you mountains of money for doing nothing, incentivizing you to embark on useless activities for letting other managers have a career, fear layoffs and when that happen complaining about it because "it's a year i'm looking for a new job" pretending same compensation and environment. Web development jobs are particularly affected by that.
In the game industry, for example, if you don't do something interesting your game won't sell a copy.
Let me stress this out again, if LLMs get you 97% there, maybe you should try another idea.
As a random example of a "hard" problem solved by AI that I couldn't have realistically done myself, despite having decades of wide industry experience:
Reverse engineering a proprietary protocol from a binary executable.
I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.
My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)
Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.
There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!
Claude wrote me a little python script to help me sort and rank all the AI videos I've generated. It also extracted the metadata and organized it into a CSV. I sent it some hex dumps of the header and it got it first try. The header structure of webms generated by comfy are pretty novel.
> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
You can dig up my past comments semi-arguing with simonw that AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that my project is modular enough where each file can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel! even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's almost always inefficient anyway, but I use its findings to manually clean up my shit. Maybe they're not that good with GDScript yet which is a bit of a jank language anyway.
So my main framework is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI sometimes: It just has to put existing blocks together, that already have well-defined interfaces and contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work, on Google's own Flights website and reverse image search! (it told me to do it myself)
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
you are experiencing reverse Dunning–Kruger effect.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
Please do not cite Dunning–Kruger effect at random.
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.
If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??
This is really my biggest worry when it gets to consumer AI. People already have a hard time informing themselves properly. Now we have technology that just boosts the already existing confirmation bias people have. It's sickening.
At that point their ground truth is completely skewed (already for some folk), everything is relative. Some of them will probably die off in self-induced Darwin award winning ways, but sadly certain skewed world views may persist.
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
My mother has started watching 100% AI generated stories on YouTube. They are good enough to be entertaining even if they include random errors like messing up the main character’s name.
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
Willison chose this task because (unlike actual images of pelicans) is was clearly not in training data, but could be reasoned about and composed from what's there. But just like those "how many golf balls can you fit in a 747?" interview questions, it should now be retired.
HN has a mechanism that causes popular blogs to stay popular.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.
They definitely get something barebones up and running, but it's far from a fully fledged application.
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
A pattern I've settled into is to write code but leave a TODO for every narrow thing I want the LLM to do for me. Then just tell the agent to fix the todos. It's often faster and easier to give "instructions" this way
In most cases you could work around that. For instance write the code yourself and make the AI write the tests. Or keep it busy writing superfluous documentation. Very few people are micromanaged to the extent that they can’t subvert the system.
I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
Usually I describe the problem, explore a bit with LLM iteratively. Then I switch to creating a plan when I have enough insight (and the LLM has it in context/same session as exploration), specifying all the things I'm trying to accomplish.
Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.
Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.
Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.
This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out”
your hypothetical boss has other things to do than kick a LLM around at that price
Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.
I totally agree. I loved coding because of its closed feedback loop. Since last November, I also delegated it mostly to agents. Now I concentrate more on the design part, which is not the same. However, you move with the times and hope something else will become exciting. I do not know a more worthwhile and satisfying way than computing to spend my work hours.
I agree, but the reality is that most people work to make a living, not to have fun. If you enjoy your job because you mostly get to write code in a tight feedback loop instead of doing the "hard" work of planning, writing and reviewing specs, balancing customer requirements, and the lot, you have a very privileged life. And those jobs are probably going to get fewer now.
It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.
To me, LLM's free up time for me so that I can spend time on the fun parts of coding. Less boilerplate, more focus on the interesting problems. This is no different from using high level languages. The problem domain is less around memory management and garbage collection and closer to the problem you're actually trying to solve.
But we’ve had tools to automate out the boilerplate for years. We don’t need ai for that. It’s seriously like we all forgot we could run one command and scaffold a project. AI isn’t even that great at it. Last I tried a month ago it used a really out of date version of nextjs and picked all sorts of random deps that weren’t in the plan.
I could have just used the next project scaffold tool and been on my way before the ai even started returning output.
I agree with this. I feel like there’s a false dichotomy right now in a lot of these discussions where one can only vibe code or only code by hand. It is possible to do both…
Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.
You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.
They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?
Supply and demand. Not many people are good at programming and it's highly in demand.
The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.
it can be easily dismissed because "anyone can use the tool that costs $20" makes no meaningful sense.
this was always true in fact $20 is more than the free it costs for notepad++
it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.
I have no idea what you're trying to say. If anyone really can vibe code then programming salaries are pretty much guaranteed to come down. The critical question is whether it really is true that anyone can do it, or if it still requires rare skill.
are you a programmer? it 100% requires skill. AI or not.
i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.
Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
While I certainly like parentheses highlighting and rainbow parentheses, I've programmed Clojure without syntax highlighting and while it’s not as nice as it would be with, it’s fine.
I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
Currently just manual. I'm not pushing the frontier here, just getting my feet wet.
While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.
Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.
It's vibing in the sense that I'm not really writing code, and I'm leaving a lot of decision to the models. I let them drive a lot of the design document details, I just made sure it contained the salient points. Implementation plans I just skimmed. Didn't write any code, just did some checks here and there.
But yes, I did think that it sorta felt like being a team lead for some eager programmers.
It's software development, but with much less actual programming (in my case none).
When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.
Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.
Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.
For me, the fun in programming is sometimes to actually write code, solving a problem in a specific way or try some new approach. Other times the fun is to create something that works, and the code is more a means to an end.
The first case I'll probably still do by hand, like handmade vases despite factory made are cheap and readily available.
For the second case I think these newfangled tools have made it even more fun, since writing lots of boiler plate, repetitive event handles and whatnot is not my idea of fun.
> planning every last detail before writing code is boring
Not only that but you can't really plan everything. It is impossible. Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.
There is no way for a programmer to consider all of these little things ahead of time and if an attempt is made, it will take as long as actually writing that code.
It wasn't trivial in that I used a lot of my programming and domain knowledge, both when iterating on the design document and skimming implementation plans.
I didn't use it often, but when it was needed it was needed.
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
I work at a company that deploys AI to enterprises
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
If you don't want an LLM to write the words, surely you also want to decide on the data and graphs to show by yourself? Isn't that 90% of a presentation? The "looking nice" part doesn't matter as much, it could be black text on a white background and it would be fine.
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
Copilot Cowork in the M365 ecosystem. It inherits all the permissions from my account, has access to exchange to send me emails, and OneDrive to save each day’s summary for posterity and future refinement.
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
If it turns out to be a good change or not is to be seen.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
Wouldn't it drive up the cost of finding vulnerabilities when all the low hanging fruit has already been scanned and patched? Like the new baseline for finding a vulnerability will be something an LLM couldn't find.
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
December 2025 was the breakthrough for me.
January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
It definitely seems like the point of no return has been passed.
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third.
Scared for the future
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
This is the magic question that I'm very eager to hear the answer to.
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
"there’s zero chance any AI lab would train a model for such a ridiculous task"
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
It depends on what you’re comparing it against. For $20, OpenAI is still probably the best value for SOTA models. In terms of limits, you can use GPT-5.4 instead of 5.5. The intelligence feels similar, but it’s cheaper. You can also experiment with other harnesses like pi. It’s lightweight but capable enough, and its token usage is definitely much more efficient.
I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time.
I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.
240 comments:
> The coding agents got really good
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
This is a pretty wild take. What percentage of human engineers are creating novel solutions for hard problems, you think? I work in R&D and even my work is 90% doing things that other people already solved. If you are really doing cutting edge SOTA work that has never been done by another human in some form or another, then kudos to you and I want your job.
> What percentage of human engineers are creating novel solutions for hard problems, you think?
IMO Every engineer should try spending his time in a company that tries to solve new problems.
Otherwise we will be stuck, as we are now, with big tech paying you mountains of money for doing nothing, incentivizing you to embark on useless activities for letting other managers have a career, fear layoffs and when that happen complaining about it because "it's a year i'm looking for a new job" pretending same compensation and environment. Web development jobs are particularly affected by that.
In the game industry, for example, if you don't do something interesting your game won't sell a copy.
Let me stress this out again, if LLMs get you 97% there, maybe you should try another idea.
As a random example of a "hard" problem solved by AI that I couldn't have realistically done myself, despite having decades of wide industry experience:
Reverse engineering a proprietary protocol from a binary executable.
I heard about people finding security vulnerabilities in compiled code with the combination of Claude Mythos wired up to a disassembler like NSA's Ghidra. Someone here mentioned that GPT 5.5 "extra high" is just as capable, I had a problem to solve, spare token quota for the week, so... I gave it a go.
My problem was that I'm working with a product that uses a legacy 1990s style network appliance output log format that is proprietary, undocumented, and has no publicly available decoders other than an app by the same vendor, and that app has fundamental limitations. (I.e.: it's nothing like Splunk or Elastic.)
Codex with a Ghidra MCP bridge figured it all out: the framing, bit and byte packing, endian order, field names, data types, etc. It made me a neat little protocol parser in a modern language that I can use to spit out something sane like NDJSON or OTLP protobufs.
There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!
Claude wrote me a little python script to help me sort and rank all the AI videos I've generated. It also extracted the metadata and organized it into a CSV. I sent it some hex dumps of the header and it got it first try. The header structure of webms generated by comfy are pretty novel.
> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
You can dig up my past comments semi-arguing with simonw that AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that my project is modular enough where each file can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel! even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's almost always inefficient anyway, but I use its findings to manually clean up my shit. Maybe they're not that good with GDScript yet which is a bit of a jank language anyway.
So my main framework is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI sometimes: It just has to put existing blocks together, that already have well-defined interfaces and contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work, on Google's own Flights website and reverse image search! (it told me to do it myself)
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
I set up a hook that reviews every commit and highlights potential bugs (async) and writes to a report to a dir.
Then I have a script that summarises that I usually run before pushing or at end of day.
Works quite well for both improving my code and the code ai wrote.
you are experiencing reverse Dunning–Kruger effect.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
Please do not cite Dunning–Kruger effect at random.
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
LLMs can effectively validate your business idea
What would you consider a "hard" problem?
The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.
If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??
I asked Gemini for a video of 'pelican riding a unicycle in hyde park' - I was blown away by the output:
https://gemini.google.com/share/55e250c99693
I'm surprised by Grok as well:
https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...
Interesting that it does better at making the pelican peddle in the video generation than in image generation.
That’s really impressive, and slightly worrying for creatives involved in film, animation or modelling.
Even more worrying are the implications for fakenews, propaganda, fraud, deception and mental health.
This is really my biggest worry when it gets to consumer AI. People already have a hard time informing themselves properly. Now we have technology that just boosts the already existing confirmation bias people have. It's sickening.
Maybe short term yes. But longer term people will finally put their guard up against deception that’s been around for decades.
At that point their ground truth is completely skewed (already for some folk), everything is relative. Some of them will probably die off in self-induced Darwin award winning ways, but sadly certain skewed world views may persist.
People will still believe what they want to believe.
If they haven’t in the past I don’t see why they would now.
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
My mother has started watching 100% AI generated stories on YouTube. They are good enough to be entertaining even if they include random errors like messing up the main character’s name.
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
The truly excellent weavers will be fine?
only SVG counts tho, dont know why
Willison chose this task because (unlike actual images of pelicans) is was clearly not in training data, but could be reasoned about and composed from what's there. But just like those "how many golf balls can you fit in a 747?" interview questions, it should now be retired.
Does this guy have a "publish to front page of HN" button on his blog editor?
HN has a mechanism that causes popular blogs to stay popular.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
He’s pretty well known in the HN community. https://en.wikipedia.org/wiki/Simon_Willison
thats a cool wiki picture
I liked the article, so if he has such a button I hope he keeps clicking it.
He's one of the main developers behind Django.
Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
he usually have good posts so people usually upvote
its better than ex-google CEO spam i see astroturfed everywhere else
I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.
They definitely get something barebones up and running, but it's far from a fully fledged application.
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
Nitpick but commercial roofers prefer pneumatic over battery.
This is a great analogy. Jan/Feb this year was when the models crossed from useful to essential.
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
It is sad. I like programming, if I couldn't do it and had to write text (which I do hate, I'm not a writer) it would be make quite a sad world.
A pattern I've settled into is to write code but leave a TODO for every narrow thing I want the LLM to do for me. Then just tell the agent to fix the todos. It's often faster and easier to give "instructions" this way
Nothing stopping you from doing that in a post-LLM world
Of course you can always program by hand, no one is stopping you.
Not sure this is true for all of us. I bet many/some (unsure here) are told to use ai for their daily programming tasks.
Plenty of companies are forcing the use of AI to people.
In most cases you could work around that. For instance write the code yourself and make the AI write the tests. Or keep it busy writing superfluous documentation. Very few people are micromanaged to the extent that they can’t subvert the system.
How do you justify your salary given that you're just using a tool that any of us could use for $20 an hour in your role?
I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
Can you share how you use it to edit code? I‘ve seen a couple approaches, curious what you are doing:
1. Spec -> plan -> code (all agent driven, maybe with grill-me or ultraplan)
2. Handwritten spec -> agent driven plan -> agent driven code
3. Agent driven spec -> vibed code -> Fix by handholding until ok-ish
4. Vibed throwaway prototypes -> extract useful patterns -> rewrite with handholding
5. Generate file structure with handholding -> manual TODO comments -> Fill in blanks with handholding
Usually I describe the problem, explore a bit with LLM iteratively. Then I switch to creating a plan when I have enough insight (and the LLM has it in context/same session as exploration), specifying all the things I'm trying to accomplish.
Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.
Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.
Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.
This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out” your hypothetical boss has other things to do than kick a LLM around at that price
Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.
> Coding is just one of the task[s] you do in your job
But it's by far the most fun part and the only reason to take such a job...
I totally agree. I loved coding because of its closed feedback loop. Since last November, I also delegated it mostly to agents. Now I concentrate more on the design part, which is not the same. However, you move with the times and hope something else will become exciting. I do not know a more worthwhile and satisfying way than computing to spend my work hours.
I agree, but the reality is that most people work to make a living, not to have fun. If you enjoy your job because you mostly get to write code in a tight feedback loop instead of doing the "hard" work of planning, writing and reviewing specs, balancing customer requirements, and the lot, you have a very privileged life. And those jobs are probably going to get fewer now.
It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.
To me, LLM's free up time for me so that I can spend time on the fun parts of coding. Less boilerplate, more focus on the interesting problems. This is no different from using high level languages. The problem domain is less around memory management and garbage collection and closer to the problem you're actually trying to solve.
But we’ve had tools to automate out the boilerplate for years. We don’t need ai for that. It’s seriously like we all forgot we could run one command and scaffold a project. AI isn’t even that great at it. Last I tried a month ago it used a really out of date version of nextjs and picked all sorts of random deps that weren’t in the plan.
I could have just used the next project scaffold tool and been on my way before the ai even started returning output.
I agree with this. I feel like there’s a false dichotomy right now in a lot of these discussions where one can only vibe code or only code by hand. It is possible to do both…
Which episode ?
Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.
You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.
They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
It is extremely ignorant.
I don't think you understand how programming as a job works, writing code is the final output of the process but it's not the job in itself.
How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?
Supply and demand. Not many people are good at programming and it's highly in demand.
The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.
it can be easily dismissed because "anyone can use the tool that costs $20" makes no meaningful sense.
this was always true in fact $20 is more than the free it costs for notepad++
it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.
I have no idea what you're trying to say. If anyone really can vibe code then programming salaries are pretty much guaranteed to come down. The critical question is whether it really is true that anyone can do it, or if it still requires rare skill.
are you a programmer? it 100% requires skill. AI or not.
i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.
no engineers on staff and stakeholders think the company is incompetent
Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code
There is no good justification for anyone's salary really, except perhaps doctors and underwater welders.
They don't need to justify it!
Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
To be fair, take away a human's paren highlighting and see how well they do.
While I certainly like parentheses highlighting and rainbow parentheses, I've programmed Clojure without syntax highlighting and while it’s not as nice as it would be with, it’s fine.
I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.
Not everyone is a "coder" you know, some of us are engineers.
You adjust pretty quickly. Taking away compiler error messages would be fun though.
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
Do you use anything to orcheatrate multiple agent pitted against each other (coder, reviewer, tester, etc)?
Currently just manual. I'm not pushing the frontier here, just getting my feet wet.
While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.
That’s not vibing, but waterfall development.
Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.
It's vibing in the sense that I'm not really writing code, and I'm leaving a lot of decision to the models. I let them drive a lot of the design document details, I just made sure it contained the salient points. Implementation plans I just skimmed. Didn't write any code, just did some checks here and there.
But yes, I did think that it sorta felt like being a team lead for some eager programmers.
> Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases.
> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.
> I do check the documents, and what they're doing. I also check the tests, some more thorough.
Sounds like programming, but with extra steps.
It's software development, but with much less actual programming (in my case none).
When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.
Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.
Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.
For me, the fun in programming is sometimes to actually write code, solving a problem in a specific way or try some new approach. Other times the fun is to create something that works, and the code is more a means to an end.
The first case I'll probably still do by hand, like handmade vases despite factory made are cheap and readily available.
For the second case I think these newfangled tools have made it even more fun, since writing lots of boiler plate, repetitive event handles and whatnot is not my idea of fun.
> planning every last detail before writing code is boring
Not only that but you can't really plan everything. It is impossible. Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.
There is no way for a programmer to consider all of these little things ahead of time and if an attempt is made, it will take as long as actually writing that code.
None of it is non-trivial tho. You might think so, but it’s not.
It wasn't trivial in that I used a lot of my programming and domain knowledge, both when iterating on the design document and skimming implementation plans.
I didn't use it often, but when it was needed it was needed.
Opus 4.5 in November 2025 was legitimately, unironically an inflection point and is the sole reason for the current hysteria.
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
5.2 and the first codex model were step function changes in capability
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
I feel the change. It went from an autocomplete tool, to an agent running 5 tasks in parallel while I just supervise. The improvement is enormous.
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
>1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid)
I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
I divide the work to fit within that 100k and use subagent for the tasks.
In my experience it's more like 400-500k tokens.
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
"flash" or "fast" AI models are worse than useless at coding for me. they make my codebase much worse. It's a maintenance burden.
Gemini Pro on the other hand can be quite a pleasant experience.
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
Sounds very self confident to claim such thing. Something like "If you don't do how me is doing, then you are doing it wrong"
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
In pure maths:
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
It makes no sense to me.
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
AI is a tool. Use it appropriately.
I work at a company that deploys AI to enterprises
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
Depends on how it’s done.
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
Can I get Claude to view the slide decks for me so I don't waste my time?
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
Wow. Seems like a headache compared to how I make slides the old fashioned way: copy and paste my figures into blank powerpoint.
With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
If you don't want an LLM to write the words, surely you also want to decide on the data and graphs to show by yourself? Isn't that 90% of a presentation? The "looking nice" part doesn't matter as much, it could be black text on a white background and it would be fine.
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
As someone who works somewhere where the intranet is a bit of a jungle: what tool do you use to scour the intranet?
Thanks!
Copilot Cowork in the M365 ecosystem. It inherits all the permissions from my account, has access to exchange to send me emails, and OneDrive to save each day’s summary for posterity and future refinement.
Thank you, I will try to find it. Thanks!
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
My day job is not in the tech industry. I am an editor. Literally nothing has changed for me in the last four years.
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
Can you give a sanitized example or a hypothetical scenario of what you mean by “output documents with code agents”? Thanks.
I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
I don't understand this response. Human artists can and do make SVGs.
I wouldn't wish creating a svg pelican on a bicycle on my worst enemy
> Every modern image-generation model can generate a pelican on a bicycle trivially.
Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
> and there’s zero chance any AI lab would train a model for such a ridiculous task.
I'm not sure that's true anymore considering how popular Simon's blog is
> So maybe the AI labs have been paying attention after all!
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
Gemini 3.1 basically takes it home on that benchmark, anyway, it's done.
It's practically a benchmark now. Some friends have been specifically training models to count the R's in "strawberry"
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
Banana man on the Segway
That bit probably works better in the talk, it was a setup for a joke later on.
All I see is mention of how various models generate image of "pelican riding bicycle(s)"
Yes, the "pelican riding a bicycle" is the ultimate test of not understanding how LLMs work.
Well, a combination of that and believing that replication of test data is a good measure of progress.
We all know the true test of AI is Will Smith eating spaghetti.
If you're a vulnerability researcher or a security person generally, there's a big inflection point from Spring of this year.
If it turns out to be a good change or not is to be seen.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
I'm a security person and would love to hear other people's input here as I don't have that much experience with this
Can you be more specific?
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
The Claude Mythos / Project Glasswing thing is real: https://www.anthropic.com/glasswing
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
People in my company sounded underwhelmed by it. It usually was founding issues by not understanding deployment (or not being fed that info).
A friend of mine had hands on experience, it’s not the intelligence of it, it’s the speed.
You used to have a couple of days to close a breach, now it 2 hours.
Wouldn't it drive up the cost of finding vulnerabilities when all the low hanging fruit has already been scanned and patched? Like the new baseline for finding a vulnerability will be something an LLM couldn't find.
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
Are you referring to Claude Mythos?
Something that’s largely been ignored: DeepSeek has made context caching virtually free with V4-Flash.
December 2025 was the breakthrough for me. January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
I find your emotional language truly quite fascinating. I've heard people talk like that about drugs.
I actually thought it was a joke comment, but I'm worried now that it's not the case.
Similarly, I've heard people talk like that about things that are not drugs.
You can get a dopamine rush from anything, from drugs to using LLMs.
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
I couldn't imagine using CC on the basic tier!
Even operations and GTM are all at "professional" level (which I think is vaguely equivalent to 5x).
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
https://github.com/openclaw/openclaw/pulse?period=daily
279 commits to main from 77 authors in the last 24 hours.
Why is there so much churn and how could you trust it with your data? This is changes in ONE day!
If these are useful changes, surely it’d be superhuman by now given months of this pace.
What are people using this for?
What real world problem is closely linked to the skill of drawing a pelican riding a bicycle?
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
In my experience the qwen models are best locally, but gemma ones have always been good. gemma4 is a notable improvement.
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.
Opus 4.5 hit that point in November.
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
It definitely seems like the point of no return has been passed.
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third. Scared for the future
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
Huh, never thought about QA writing unit tests.
In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.
At my current job I don't want to miss them.
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
The problems in any domain are infinite. But, alas, money is not.
What are these skills?
This is the magic question that I'm very eager to hear the answer to.
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
Being able to work with an infinite amount of dumb interns that work super fast and have a vast amount of knowledge.
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.
The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
"there’s zero chance any AI lab would train a model for such a ridiculous task"
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
Spot on. Building our tool, we found AI is magic at scraping competitor data, but terrible at market validation. The 'why' is strictly human.
So, the best way to use LLMs is to wait for your competitors to do market validation and then scrape their data.
Hmmm......
It's always been much easier to copy an existing product than to make a new one nobody's thought of before.
sorry but how this comment refers to the commented post?
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
Is the RLVR the key breakthrough for the uplift or is there more to it?
Does that suggest the uplift was only for things that are easily verifiable like code?
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
RL we're gonna find out will get abandoned cuz we don't even know what is getting "aligned", just my naive gut feeling don't take it seriously
Apart from GLM 5.1 and Qwen 3.6, there are other Chinese models that are noteworthy: Kimi K2.6, Xiaomi MiMo V2.5 Pro, Deepseek v4 and MiniMax M2.7.
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
Looking forward to next time, hoping you mention speculative decoding and MTP :)
It would support your point about the performance of 20GB local models.
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
"That's a higher level of abstraction"
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
> Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Proof by existence?
https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
Looks pretty good to me. ChatGPT in "Thinking" model.
Edit: I've added the Opus version on the same link.
Those are just awful compared to the side view of a pelican on a bike.
Are we a long way away?
https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
Link doesn't work - maybe not public?
> That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.
Wow! Actually a sensible comment under all the astroturfing that even this place is so full of now.
Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?
Is the only choice to pay for the "max" plans?
Or just read so much about it that you bs your way through an interview and then use the company's resources?
Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
$20 chatgpt pro plan gives pretty generous usage both of codex, general chat
Ah I'd read so much about the downgrading of that plan I didn't think that was still true?
It depends on what you’re comparing it against. For $20, OpenAI is still probably the best value for SOTA models. In terms of limits, you can use GPT-5.4 instead of 5.5. The intelligence feels similar, but it’s cheaper. You can also experiment with other harnesses like pi. It’s lightweight but capable enough, and its token usage is definitely much more efficient.
The claw thing really came and went fast lol
I just started a new job and the person I report to was just excited to tell me about it, here in Mid May
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time. I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
It’s good to see dates being hard coded re. Improvements in the models that should deliver material gains.
As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.
I met Simon for the first time this year at pycon. Wow, what a great guy.
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.