This reminds me of Antirez's "Don't fall into the anti-AI hype" [0]
In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
There have been two reactions: "Oh it would never work for me" and "I have seen months of my life accomplished in an hour", and I think they're both right. I think we should be excited for Antirez, (who has since been popping off [1]), and I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
>I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work
I don't believe that anymore, to be honest. Models are starting to get good at ambiguity. Claude Code now asks me when something is ambiguous. Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!). If they can ask you now, they'll be able to search for the answers themselves once that's possible. In fact, they already do it now if you have a well-documented Notion/Confluence, it's just that nobody has.
It's probably harder to RL for "identify ambiguity" than RL'ing for performance algorithms, sure, but it's not impossible and it's in the works. It's just a matter of time now.
That's fair, and something I've observed too. I wish I had written "the rest of us shouldn't freak out and quit software today".
But here's another data point: At the biotech I work for, writing good code has never been the bottleneck. I actually told my boss that a paid Claude vs free subscription wouldn't be that much value because even if it took every piece of code or algorithm we've ever written and 10x-ed the hell out of them, we'd still be bottlenecked by the biology and physics which dictates that we wait 24 days for our histology assay pipeline.
I have a hunch most fields outside of software are this way. And I'm personally not planning to quit anytime soon.
> Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!)
We were doing that over at Vowel a few years back, unfortunately it didn't pan out because you're competing directly against Zoom, Google Meet, Microsoft Teams, etc. They are all (slowly) catching up to where we were as a scrappy startup 4 years ago.
It was truly game-changing to have all of your meetings in an easily searchable database. Even as a human.
Tacit knowledge is definitionally not recorded in any of these systems. This proposes to solve the problem of tacit knowledge by getting rid of it. It is not clear to me if that solution is either possible or desirable.
The labs are spending hundreds of millions of dollars hiring people doing many fairly random (but economically valuable) tasks to collect this tacit knowledge for RL.
> It ceases to become tacit as soon as it is collected.
I'm not sure.
It it is collected via preferences then it isn't necessarily something that can be communicated (except in the LLM's latent space).
That still feels tacit to me.
To simplify that argument, the relationship between King and Queen in the Word2Vec latent space can be easily explicitly labelled.
But the relationship between Napoleon and Tsar Alexander I also exists and encodes much of the tacit knowledge about their relationship but isn't as easily labelled (eg, Google AI Mode says "Napoleon I and Tsar Alexander I had a volatile "bromance" that shifted from mutual admiration to deep animosity, acting as a defining conflict of the Napoleonic Wars".)
Word2Vec is a very simple model. In a more complex LLM that deeper knowledge can be queried by asking questions but you can never capture it all. Isn't that what "tacit knowledge" is?
Sure. Still from what he said, your company wants every communication from you stored somewhere, ready for analysis. I don't think an unfiltered data acquisition is good, my interpretation and decision making is also part of my work. Also meetings may share some personal details that I would never tell on the record.
Full transparency has a cost, and we cannot afford it.
Why record when it can build in realtime as meeting is going on.
Slack is kinda there with Salesforce - can do a lot already on Agentforce and in Slackbot, but two aren't integrated just yet and Slackbot doesn't support group chats/channels. One interesting aspect in this will be - who has superiority boss, client, analyst or developer?
Unfortunately you can't record meetings in many jurisdictions, including court sessions. Hence we have to rely - for worse, or perhaps even for better - on human driven note taking.
You're downplaying the AI lobby here. They're eating down copyright laws, something that seemed impossible just a couple of years ago. Screwing privacy laws is just the next step.
Also, we are seeing a cultural shift around that as well. Now people bring "AI notetakers" to Zoom calls without even asking for your permission. People are already acting like privacy laws don't exist anymore, it's going to be even easier for the AI lobby to take it down now. Just like piracy normalized copyright infringement, opening the path to the current rulings around "fair training".
Such invasive practices are pretty disgusting. But I don't think it will be pervasive. Once it spreads, AI vendors and abusive companies will be hold accountable. There is also an obvious conflict, the surveillance will likely be very selective. Programmers have to record everything, while middle managers have a choice to sign off everything. Senior management will of course do whatever but have full insight on the data. This will create even more backlash. Of course the social culture will turn stone cold and hostile over night with such installments.
I have found Claude et al good at quickly implementing the algorithm I have in mind effectively, as long as I ask lots of control questions and check code. They aren’t good at inventing non-mainstream algorithms though and often slip staggeringly short term shortcuts in though. They are still a tool and not yet the craftsman who wields tools effectively. This will steadily change, and the corners where the obscure algorithm wins will erode further too.
Private investment in the US has grown from 100 billion in 2024 to almost 300 billion USD in 2025 [0]. Add public investments worldwide and private investments in at least China and Europe.
I'm pretty sure money is not going to be the blocker.
Why not both? You don’t need 1trillion allocated before you have a proof of concept to demonstrate your non-LLM model, and once you have a PoC you will definitely have the larger investors interested
Advanced Machine Intelligence (AMI), a new Paris-based startup cofounded by Meta’s former chief AI scientist Yann LeCun, announced Monday it has raised more than $1 billion to develop AI world models.
LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said. [0]
I don't think it's valid to draw broad conclusions from the funding of a new company vs. an industry leader. If AMI builds something that looks impressive considering the funding they got, then they'll get plenty more in the next round.
AI is hands down the most researched topic in CS departments. Of the 10 largest companies (by market cap), only 3 aren't balls-deep in AI R&D. The fastest growing (private or public) companies by revenue are also almost all companies focused primarily on AI (Anthropic, OpenAI, xAI, Scale AI, Nvidia).
And the money isn't even the most important part. It's all about mindshare and collective research time. The architectural concepts can be researched and developed on top of open models, so even individual relatively poor researchers unaffiliated to anything can make breakthroughs.
Even the computing required for the legendary "Attention is all you need" paper could probably be recreated on con-/prosumer hardware in a month's time.
Why on earth would you start your ai startup in Paris? Of all places in western Europe it's one of the hardest to find, attract and keep talented people. The wages are super low, housing is high and language is an issue.
I mean, Google already has Mu Zero, which Im willing to bet has evolved quite a bit in private because if anything is going to get us closer to actual AI its that.
Realistically, one can build a AI capable of reasoning (i.e recurrent loops with branches) using very basic models that fit on a 3090, with multi agent configuration along the lines https://github.com/gastownhall/gastown. Nobody has done it yet because we don't know what the number of agents is required and what the prompts for those look like.
The fundamental philosophical problem is if that configuration is possible to arrive at using training, or do ai agents have to go through equivalent "evolution epocs" to be able to do all that in a simulated environment. Because in the case of those prompts and models, they have to be information agnostic.
Its not hard to tell at all, just look at how much it costs to run a 10T param model (especially with parallelized agents). Those costs are not worth the occasional slot machine-eque jackpot you get. For an entity like Google it might be worth it, but that's it. They definitely aren't going to let us use these things for cost they are now for much longer.
Imagine going back to 2020 and tell people in 6 years going to be able to spend $200.00 a month and be able to spin up $2mm in GPUs at full throttle to respond to your emails. None of this makes sense.
I don't know, I guess it depends from a) how many hours per month you spend answering emails, and b) how much more revenue you could get in that same time. $200 should be reasonably 2/3 hours of work? So that's about the amount of saved time per month to break even on your subscription. It's a steal.
Whenever you solve any hard problem, you start off by finding a complicated solution, which you then scale down to a simpler solution.
LLMs are a "complicated solution" in the sense that they're expensive. Once you know what they're capable of, you can scale them down to something less expensive. There's usually a way.
Also, an important advantage of LLMs over other approaches is that it's easy to improve them by finding better ways of prompting them. Those prompting strategies can then get hard-coded into the models to make them more efficient. Rinse and repeat. Similarly, you can produce curated data to make them better in certain areas like programming or mathematics.
>I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
A Statement all but guaranteed to look incredibly short sighted by 2030.
The past few years has seen a great rise in casuals reminding us of AIs limitations only to be proven wrong in 6 months. I don't think we're close to AGI, but in 2 years I've gone from AI doubter to AI convert. It's not perfect, but I don't need it to be.
The real question to me is if the system can pay for itself. Economics are racing against efficiency gains and it's anyone's guess which wins.
what are those limitations we're talking about? seems most of those the original limitations that people complained about were resolved through workarounds like tools and skills which are more software-engineering than llm advancement.
Yes. The models are good, the models are fast, and the internal tooling has caught up at this point too. There's a lot of UI/UX/tooling stuff that's still being worked through, integrations with VCS, and solving deeper problems that I probably can't talk about, but I'd say the frustrations of most are about the rate of change much more than the actual abilities.
One thing that's interesting is a bunch of internal thought leaders who swear by the Flash models over the Pro models. Whether this is true or not doesn't really matter, the interesting bit to me is that we are at a point with the models where "better" models are not necessarily more useful, and that faster with more work on the harnesses may be a better trade-off.
Happy to chat internally if you want, feel free to reach out.
I see a lot of people swearing by one model, but without trying others. I see a lot of opinions based on a snapshot of tooling from ~January, when for example Claude Code was exceptional, but that don't appear to have been updated. In blind tests the models appear to be much closer than some folks would have you believe.
If you mean specifically the Gemini VS Code Extension: it's terrible compared to Claude Code or Codex. I don't know how they can get away with it. Just constant timeouts, weird failure modes, have to start a new chat to switch modes... but I don't think any of that is specific to gemini the model- it seems to be the extension.
As for actual solutions to problems ignoring the VS Code extension aspect, I find all three premiere models to be excellent coding agents for my purposes.
The overall quality of LLM coding tools is shockingly bad. I haven't found a single one without major issues, and many have the same problems reappear every few months, sometimes bad enough to almost break the entire thing (e.g. 100% failure rate in editing files, broken for weeks, with the same cause each time, multiple times in a year).
Note that coding is not the only use of Gemini or any of these models. It's also not what this article is talking about. Gemini can be not the best coding agent, but very good at other things.
> He says the problem is that they can't use Claude Code because it's the enemy, and Gemini has never been good enough to capture people's workflows like Claude has, so basically agentic coding just never really took off inside Google. They're all just plodding along, completely oblivious to what's happening out there right now.
This is a bunch of gabagoo. Wrong on so many layers, it's not even worth reading further.
a) goog has agentic coding in both antigravity & cli forms. While it is not at the level of cc + opus, it's still decent.
b) goog has their own versions of models trained on internal code
c) goog has claude in vertex, and most definitely can set it up in secure zones (like they can for their clients) so they'd be able to use claude (at cost) within their own projects.
I’m not so sure. From talking to some of my own friends at google they feel that antigravity/gemini models are handicapping them and would much rather be using claude code (which only deepmind gets to use)
Google’s businesses are very broad and durable. But Google being the only company in the world without access (except for GDM+labs) to a competent coding agent will take a toll.
We’ll see how long Google can hold out hoping for GDM to create something that is competitive.
I’m guess that within 6 months Google will give up on coding and finally let their devs use Claude/Codex.
This isn’t a security problem, this is a GDM issue with GDM’s promises being far beyond their ability.
I for one can't tell the difference between Claude and Gemini for coding. And the internal agent tooling is many times faster than Claude Code in my experience.
The AI CEOs love to pontificate about AI curing cancer, but it seems like DeepMind is the only one actively working on these research problems, while OpenAI/Anthropic largely chase enterprise/coding revenue.
There is an apples and oranges difference between AI improving itself (becoming more capable) and AI optimizing software that happens to be used for AI training or inference.
A more efficient transformer just costs less to run.
"AI improving AI" would be if one generation of AI designed a next-gen AI that was fundamentally more capable (not just faster/cheaper) than itself. A reptilian brain that could autonomously design a mammalian brain.
Even when hooked up into a smart harness like AlphaEvolve, I don't think LLMs have the creativity to do this, unless the next-gen architecture is hiding in plain sight as an assemblage of parts than an LLM can be coaxed into predicting.
More likely it'll take a few more steps of human innovation, steps towards AGI, before we have an AI capable of autonomous innovation rather than just prompted mashup generation.
> Do we have other examples of AI being used to improve the LLMs
Yes, last year when they revealed AlphaEvolve they used a previous gemini model to improve kernels that were used in training this gen models, netting them a 1% faster training run. Not much, but still.
This is the thing to look for in 2027, imho. All the big AI labs have big projects working on research agents, also specifically into improving AI (duh) and I expect a lot of that to get out of the experimental phases this year.
Next year they actually get to do a lot of work and I think we will see the first big effective architectural change co-invented by AI.
An issue I have been noticing with claude is, for simple tasks, it gives extremely bloated code and artifacts, which sometime does not even work. Gemini balances it quite well, by giving a working solution with the exact amount of code and minimal complexity, that is easier to manage.
The only thing I go to Claude these days, is for front-end code (HTML). Here also, it gives too much of CSS code (60% of the file size), but I'm OK with that as it gives a bit of polished look, though heavy on file size.
All the *Evolve publications have very impressive results but from the time I’ve spent on the information published I feel that the attention goes to the LLMs and the AI side of things, although the outcomes reported are in almost all cases the result of very well designed environments for both the LLM and the evolutionary algorithm to work well.
This paper here is a great example of that and it’s worth a reading.
How many times we have to hear again about Erdös problems? :) It sounds like a great achievement for humanity at first, but after a while they keep coming back!
I wish that Google would focus on bringing their Gemini 3.x models to GA, and provide enough capacity such that one not constantly has to fight with 429 errors.
It often feels like they do not want me to develop applications for corporate clients using their Vertex API. It is just such a shame, given that their models were so great for document analysis etc.
No, for clients we use paid Vertex AI accounts. We often need to host workloads in an EU region, which rules out “global” models (and probably better capacity).
In the past, we used a wrapper that round-robined across multiple projects to get enough quota. Luckily, many of our workloads are workflow-style tasks, so we can simply keep retrying on 429s.
Fun fact: for one of their services, I think it was Stitch, I noticed that my paid key kept hitting quota, while the free worked fine. That blew my mind.
I would be interested to see how exactly the agent helped. How was it used, where did it lead to the given improvement and in how far would it have taken a human to come to the same solution.
The CANOS arxiv link says absolutely nothing about AlphaEvolve, Gemini, or LLMs. It seems to use purely traditional ML models. If AE did in fact write a quick script to test different configurations in order to optimize the results, they don't seem to have bothered to write about it.
I can't read the Nature paper about DeepConsensus, but from the summary, it doesn't really explain what role AE had in improving DC. It would be nice to be able to read about what role it actually played, and whether it used traditional or novel methods of performing it
They'll likely make it available at some point, but for now one can use OpenEvolve [0] which is not quite as good but should be a good start to use the same LLM-driven evolutionary framework.
AlphaEvolve couples map-elites with LLMs. It's an key step in machine learning, in the vein of DQN for reinforcement learning.
AE brings diversity from the genetic algorithms community to large scale optmized deep learning and RL models.
It is a mandatory step for moving forward. The approach is clean and simple, while generic.
The only caveats is the per optimization problem definition of the map élites dimensions. But surely, this will get tackled somehow over the next few years.
If you don't know about map-elites, go look up Jean-Baptiste Mouret' s work and talks, it's both very interesting and universal.
We went from 'AI will replace programmers' to 'AI will help programmers' to 'AI writes code while other AI reviews it' in about 18 months. At this rate the humans are just providing the electricity.
From the comments it seems that this community (mostly career software people) is starting to move into a new phase of grief about the median software engineer losing their hoped for permanent place in society.
-2021-2024 was Denial
-2024-2025 was Anger and Bargaining
-2026 seems to be some combo of anger, bargaining and acceptance depending mostly on your class/age
What I'm most curious about is how this translates to messy, real-world codebases without well-defined metrics. Most production software isn't chip design or kernel optimization - it's business logic with unclear success criteria. The infrastructure story is impressive, but I'd love to see how they handle domains where the evaluation function itself is ambiguous.
> In advertising and marketing, WPP used AlphaEvolve to refine AI model components, navigating complex, high-dimensional campaign data and achieving 10% accuracy gains over their competitive manual model optimizations.
Ah good, we're getting closer and closer to Venus, Inc. every day. /s
122 comments:
This reminds me of Antirez's "Don't fall into the anti-AI hype" [0]
In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
There have been two reactions: "Oh it would never work for me" and "I have seen months of my life accomplished in an hour", and I think they're both right. I think we should be excited for Antirez, (who has since been popping off [1]), and I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
[0] https://antirez.com/news/158 [1] https://antirez.com/news/164
>I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work
I don't believe that anymore, to be honest. Models are starting to get good at ambiguity. Claude Code now asks me when something is ambiguous. Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!). If they can ask you now, they'll be able to search for the answers themselves once that's possible. In fact, they already do it now if you have a well-documented Notion/Confluence, it's just that nobody has.
It's probably harder to RL for "identify ambiguity" than RL'ing for performance algorithms, sure, but it's not impossible and it's in the works. It's just a matter of time now.
> Models are starting to get good at ambiguity
That's fair, and something I've observed too. I wish I had written "the rest of us shouldn't freak out and quit software today".
But here's another data point: At the biotech I work for, writing good code has never been the bottleneck. I actually told my boss that a paid Claude vs free subscription wouldn't be that much value because even if it took every piece of code or algorithm we've ever written and 10x-ed the hell out of them, we'd still be bottlenecked by the biology and physics which dictates that we wait 24 days for our histology assay pipeline.
I have a hunch most fields outside of software are this way. And I'm personally not planning to quit anytime soon.
Ok, but you job is clearly not a good sample for a "job most mortals work on".
> Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!)
We were doing that over at Vowel a few years back, unfortunately it didn't pan out because you're competing directly against Zoom, Google Meet, Microsoft Teams, etc. They are all (slowly) catching up to where we were as a scrappy startup 4 years ago.
It was truly game-changing to have all of your meetings in an easily searchable database. Even as a human.
Tacit knowledge is definitionally not recorded in any of these systems. This proposes to solve the problem of tacit knowledge by getting rid of it. It is not clear to me if that solution is either possible or desirable.
The labs are spending hundreds of millions of dollars hiring people doing many fairly random (but economically valuable) tasks to collect this tacit knowledge for RL.
It works really well.
It ceases to become tacit as soon as it is collected.
Maybe this rephrase will help: the proposed solution is to render all knowledge explicit.
> It ceases to become tacit as soon as it is collected.
I'm not sure.
It it is collected via preferences then it isn't necessarily something that can be communicated (except in the LLM's latent space).
That still feels tacit to me.
To simplify that argument, the relationship between King and Queen in the Word2Vec latent space can be easily explicitly labelled.
But the relationship between Napoleon and Tsar Alexander I also exists and encodes much of the tacit knowledge about their relationship but isn't as easily labelled (eg, Google AI Mode says "Napoleon I and Tsar Alexander I had a volatile "bromance" that shifted from mutual admiration to deep animosity, acting as a defining conflict of the Napoleonic Wars".)
Word2Vec is a very simple model. In a more complex LLM that deeper knowledge can be queried by asking questions but you can never capture it all. Isn't that what "tacit knowledge" is?
So self chosen total surveillance and transparency so your fav LLM can be better?
Could always use a local LLM for stuff like that. One of my relatives works for one of the big audit firms and that's what they do.
Sure. Still from what he said, your company wants every communication from you stored somewhere, ready for analysis. I don't think an unfiltered data acquisition is good, my interpretation and decision making is also part of my work. Also meetings may share some personal details that I would never tell on the record.
Full transparency has a cost, and we cannot afford it.
Why record when it can build in realtime as meeting is going on.
Slack is kinda there with Salesforce - can do a lot already on Agentforce and in Slackbot, but two aren't integrated just yet and Slackbot doesn't support group chats/channels. One interesting aspect in this will be - who has superiority boss, client, analyst or developer?
In coding the ambiguity is very, very limited and constrained compared to any non dev job that involves any decision making
That's.. not even close to being the case. It's literally a series of ambiguous questions and strategic decisions.
Non-ambiguous is like a first semester algorithms class in university.
Unfortunately you can't record meetings in many jurisdictions, including court sessions. Hence we have to rely - for worse, or perhaps even for better - on human driven note taking.
You're downplaying the AI lobby here. They're eating down copyright laws, something that seemed impossible just a couple of years ago. Screwing privacy laws is just the next step.
Also, we are seeing a cultural shift around that as well. Now people bring "AI notetakers" to Zoom calls without even asking for your permission. People are already acting like privacy laws don't exist anymore, it's going to be even easier for the AI lobby to take it down now. Just like piracy normalized copyright infringement, opening the path to the current rulings around "fair training".
Such invasive practices are pretty disgusting. But I don't think it will be pervasive. Once it spreads, AI vendors and abusive companies will be hold accountable. There is also an obvious conflict, the surveillance will likely be very selective. Programmers have to record everything, while middle managers have a choice to sign off everything. Senior management will of course do whatever but have full insight on the data. This will create even more backlash. Of course the social culture will turn stone cold and hostile over night with such installments.
thanks for the downvote anon. its an convenient conversation.
I disagree but it wasn't me who downvoted, just so you know.
Yeah I wasn't accusing you. Was likely that you disagree, I can deal with that.
I have found Claude et al good at quickly implementing the algorithm I have in mind effectively, as long as I ask lots of control questions and check code. They aren’t good at inventing non-mainstream algorithms though and often slip staggeringly short term shortcuts in though. They are still a tool and not yet the craftsman who wields tools effectively. This will steadily change, and the corners where the obscure algorithm wins will erode further too.
> I think the rest of us should rest easy knowing that LLM's can't [...]
What if (when?) (AI-assisted) research moves AI beyond LLMs? Do you think that can't happen?
Not in the next decade. Won't get funded.
Private investment in the US has grown from 100 billion in 2024 to almost 300 billion USD in 2025 [0]. Add public investments worldwide and private investments in at least China and Europe.
I'm pretty sure money is not going to be the blocker.
[0] https://hai.stanford.edu/ai-index/2026-ai-index-report
The money will go to LLMs.
Why not both? You don’t need 1trillion allocated before you have a proof of concept to demonstrate your non-LLM model, and once you have a PoC you will definitely have the larger investors interested
You will need 100s of billions to make a viable POC.
For a PoC? That sounds very unlikely. I think you’re off by at least 2–3 orders of magnitude
Let's wait 10 years and see.
You only need to train a range of small models in order to establish a plausible scaling law, IMO.
Advanced Machine Intelligence (AMI), a new Paris-based startup cofounded by Meta’s former chief AI scientist Yann LeCun, announced Monday it has raised more than $1 billion to develop AI world models.
LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said. [0]
[0] https://www.wired.com/story/yann-lecun-raises-dollar1-billio...
Now check how much OpenAI got in their last funding round, and you have your answer.
I don't think it's valid to draw broad conclusions from the funding of a new company vs. an industry leader. If AMI builds something that looks impressive considering the funding they got, then they'll get plenty more in the next round.
He must be trolling.
AI is hands down the most researched topic in CS departments. Of the 10 largest companies (by market cap), only 3 aren't balls-deep in AI R&D. The fastest growing (private or public) companies by revenue are also almost all companies focused primarily on AI (Anthropic, OpenAI, xAI, Scale AI, Nvidia).
And the money isn't even the most important part. It's all about mindshare and collective research time. The architectural concepts can be researched and developed on top of open models, so even individual relatively poor researchers unaffiliated to anything can make breakthroughs.
Even the computing required for the legendary "Attention is all you need" paper could probably be recreated on con-/prosumer hardware in a month's time.
1B is what Microsoft invested in Open AI in 2019[0]. That was enough to get the ball rolling.
[0] https://en.wikipedia.org/wiki/OpenAI#Creation_of_for-profit_...
Why on earth would you start your ai startup in Paris? Of all places in western Europe it's one of the hardest to find, attract and keep talented people. The wages are super low, housing is high and language is an issue.
Probably because LeCun is from there. But top AI talent needs to be paid top cash and the taxes there are brutal for high earners especially.
I mean, Google already has Mu Zero, which Im willing to bet has evolved quite a bit in private because if anything is going to get us closer to actual AI its that.
Realistically, one can build a AI capable of reasoning (i.e recurrent loops with branches) using very basic models that fit on a 3090, with multi agent configuration along the lines https://github.com/gastownhall/gastown. Nobody has done it yet because we don't know what the number of agents is required and what the prompts for those look like.
The fundamental philosophical problem is if that configuration is possible to arrive at using training, or do ai agents have to go through equivalent "evolution epocs" to be able to do all that in a simulated environment. Because in the case of those prompts and models, they have to be information agnostic.
I'd say it's a malefactor of:
1. Amazing, you just tweaked 1% efficiency
2. You idiot, you just spent an hour trying to trouble shoot a hallucinated api.
On average, it's really hard to tell which ones going to win here.
Its not hard to tell at all, just look at how much it costs to run a 10T param model (especially with parallelized agents). Those costs are not worth the occasional slot machine-eque jackpot you get. For an entity like Google it might be worth it, but that's it. They definitely aren't going to let us use these things for cost they are now for much longer.
Imagine going back to 2020 and tell people in 6 years going to be able to spend $200.00 a month and be able to spin up $2mm in GPUs at full throttle to respond to your emails. None of this makes sense.
You don't pay for a £200 a month account to respond to your emails, and if you are, I would tell you that you're wasting your money.
I don't know, I guess it depends from a) how many hours per month you spend answering emails, and b) how much more revenue you could get in that same time. $200 should be reasonably 2/3 hours of work? So that's about the amount of saved time per month to break even on your subscription. It's a steal.
Whenever you solve any hard problem, you start off by finding a complicated solution, which you then scale down to a simpler solution.
LLMs are a "complicated solution" in the sense that they're expensive. Once you know what they're capable of, you can scale them down to something less expensive. There's usually a way.
Also, an important advantage of LLMs over other approaches is that it's easy to improve them by finding better ways of prompting them. Those prompting strategies can then get hard-coded into the models to make them more efficient. Rinse and repeat. Similarly, you can produce curated data to make them better in certain areas like programming or mathematics.
they're not _compplicated_, their complex. And solution implies they're not hallucinating the goat and how to fix it.
Do you realize you're fighting a strawman or do you actually think this is a compelling argument?
oh, sorry, I'm not running a 10T param. Just local models for me. kk thx.
>I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
A Statement all but guaranteed to look incredibly short sighted by 2030.
The past few years has seen a great rise in casuals reminding us of AIs limitations only to be proven wrong in 6 months. I don't think we're close to AGI, but in 2 years I've gone from AI doubter to AI convert. It's not perfect, but I don't need it to be.
The real question to me is if the system can pay for itself. Economics are racing against efficiency gains and it's anyone's guess which wins.
what are those limitations we're talking about? seems most of those the original limitations that people complained about were resolved through workarounds like tools and skills which are more software-engineering than llm advancement.
Are Googlers themselves happy using Gemini coding agent instead of Claude Code or Codex? (no snark, I'm really asking)
Yes. The models are good, the models are fast, and the internal tooling has caught up at this point too. There's a lot of UI/UX/tooling stuff that's still being worked through, integrations with VCS, and solving deeper problems that I probably can't talk about, but I'd say the frustrations of most are about the rate of change much more than the actual abilities.
One thing that's interesting is a bunch of internal thought leaders who swear by the Flash models over the Pro models. Whether this is true or not doesn't really matter, the interesting bit to me is that we are at a point with the models where "better" models are not necessarily more useful, and that faster with more work on the harnesses may be a better trade-off.
> a bunch of internal thought leaders who swear by the Flash models over the Pro models
I'm coming around on this too. deepseek-v4-flash is impressive.
>One thing that's interesting is a bunch of internal thought leaders who swear by the Flash models over the Pro models.
I've seen people outside Google favoring flash Gemini models over the Pro.
There are also some benchmarks where flash models have higher scores, so yes, apparently speed does matter.
You’re absolutely kidding yourself if you genuinely believe that.
Happy to chat internally if you want, feel free to reach out.
I see a lot of people swearing by one model, but without trying others. I see a lot of opinions based on a snapshot of tooling from ~January, when for example Claude Code was exceptional, but that don't appear to have been updated. In blind tests the models appear to be much closer than some folks would have you believe.
If you mean specifically the Gemini VS Code Extension: it's terrible compared to Claude Code or Codex. I don't know how they can get away with it. Just constant timeouts, weird failure modes, have to start a new chat to switch modes... but I don't think any of that is specific to gemini the model- it seems to be the extension.
As for actual solutions to problems ignoring the VS Code extension aspect, I find all three premiere models to be excellent coding agents for my purposes.
The overall quality of LLM coding tools is shockingly bad. I haven't found a single one without major issues, and many have the same problems reappear every few months, sometimes bad enough to almost break the entire thing (e.g. 100% failure rate in editing files, broken for weeks, with the same cause each time, multiple times in a year).
I'd say I'm surprised by it, but uh
>The overall quality of LLM coding tools is shockingly bad
Most of them were vibecoded in days, so what do you expect? And new versions just add features, they never fix the old cruft.
Probably there would be some money to be made if someone actually takes the time to write a good agent harness.
Note that coding is not the only use of Gemini or any of these models. It's also not what this article is talking about. Gemini can be not the best coding agent, but very good at other things.
The point of dogfooding is exactly that: if we're unhappy, we're the ones to improve.
the engineers using gemini have no control over deepmind
Are you in the Gemini team?
Last month, Steve Yegge suggested that they are not: https://xcancel.com/Steve_Yegge/status/2043747998740689171
> He says the problem is that they can't use Claude Code because it's the enemy, and Gemini has never been good enough to capture people's workflows like Claude has, so basically agentic coding just never really took off inside Google. They're all just plodding along, completely oblivious to what's happening out there right now.
This is a bunch of gabagoo. Wrong on so many layers, it's not even worth reading further.
a) goog has agentic coding in both antigravity & cli forms. While it is not at the level of cc + opus, it's still decent.
b) goog has their own versions of models trained on internal code
c) goog has claude in vertex, and most definitely can set it up in secure zones (like they can for their clients) so they'd be able to use claude (at cost) within their own projects.
Agreed, however imo there is def some problems unique to Google which is making the internal experience less than ideal.
Hoping they can figure it out sooner rather than later.
Demis Hassabis chimed in on that thread and called it what it is: clickbait.
I’m not so sure. From talking to some of my own friends at google they feel that antigravity/gemini models are handicapping them and would much rather be using claude code (which only deepmind gets to use)
Sure, but there's cavernous distance between "google = john deere" and "darn I have to use Gemini"
He was entirely correct.
He made a follow up after the pushback by GDM.
Google’s businesses are very broad and durable. But Google being the only company in the world without access (except for GDM+labs) to a competent coding agent will take a toll.
We’ll see how long Google can hold out hoping for GDM to create something that is competitive.
I’m guess that within 6 months Google will give up on coding and finally let their devs use Claude/Codex.
This isn’t a security problem, this is a GDM issue with GDM’s promises being far beyond their ability.
There is value in the "eating your own dog food".
If internal staff aren't happy with the tools they build, typically that should drive improvements to their own tools
This couldn't be further from the truth
I for one can't tell the difference between Claude and Gemini for coding. And the internal agent tooling is many times faster than Claude Code in my experience.
they use a web based vscode like (cider) with a custom agent
Antigravity comes to mind
they use claude code at deepmind
Not a Googler, but I use gemini in JetBrains Junie and have no issues with it. It's cheap, very fast and most importantly actually listens to you.
Codex?
The AI CEOs love to pontificate about AI curing cancer, but it seems like DeepMind is the only one actively working on these research problems, while OpenAI/Anthropic largely chase enterprise/coding revenue.
Google can self fund from their war chest while OpenAI and Anthropic are hat in hand.
AI improving itself (or at least the architecture it runs on), the singularity is near as they say.
Do we have other examples of AI being used to improve the LLMs, apart for the creation of synthetic data and the testing of the models?
There is an apples and oranges difference between AI improving itself (becoming more capable) and AI optimizing software that happens to be used for AI training or inference.
A more efficient transformer just costs less to run.
"AI improving AI" would be if one generation of AI designed a next-gen AI that was fundamentally more capable (not just faster/cheaper) than itself. A reptilian brain that could autonomously design a mammalian brain.
Even when hooked up into a smart harness like AlphaEvolve, I don't think LLMs have the creativity to do this, unless the next-gen architecture is hiding in plain sight as an assemblage of parts than an LLM can be coaxed into predicting.
More likely it'll take a few more steps of human innovation, steps towards AGI, before we have an AI capable of autonomous innovation rather than just prompted mashup generation.
I don't think there is a fundamental divide between implementation speedups and optimization and algorithmic/architecture optimizations
> Do we have other examples of AI being used to improve the LLMs
Yes, last year when they revealed AlphaEvolve they used a previous gemini model to improve kernels that were used in training this gen models, netting them a 1% faster training run. Not much, but still.
Self improving, doesn’t necessarily imply singularity right?
There still could be hard constraints to make singularity intractable or just such a long time horizon it’s not practical right?
I feel like the most viral lately is https://github.com/karpathy/autoresearch
> AI improving itself
This is the thing to look for in 2027, imho. All the big AI labs have big projects working on research agents, also specifically into improving AI (duh) and I expect a lot of that to get out of the experimental phases this year.
Next year they actually get to do a lot of work and I think we will see the first big effective architectural change co-invented by AI.
And then on 2028 we will be selling ice cream at the beach.
Shameless plug: https://huggingface.co/spaces/smolagents/ml-intern
It’s a simple harness around Opus, but with tight integration to Hugging Face infra, so the agent can read papers, test code and launch experiments
What are the benchmarks for this, in terms of costs of computation and error; cost to converge?
Re: hyperparameter tuning and autoresearch: https://news.ycombinator.com/item?id=47444581
Parameter-free LLMs would be cool
Singularities are a sign that you have a broken model.
The hard part about this is for every few 'WOW', there's a lineage of 'you dumbass'.
I mean, if you can create aharrness to filter these two, sure, singularity away; it's really hard to see how someones gonna do that.
An issue I have been noticing with claude is, for simple tasks, it gives extremely bloated code and artifacts, which sometime does not even work. Gemini balances it quite well, by giving a working solution with the exact amount of code and minimal complexity, that is easier to manage.
The only thing I go to Claude these days, is for front-end code (HTML). Here also, it gives too much of CSS code (60% of the file size), but I'm OK with that as it gives a bit of polished look, though heavy on file size.
All the *Evolve publications have very impressive results but from the time I’ve spent on the information published I feel that the attention goes to the LLMs and the AI side of things, although the outcomes reported are in almost all cases the result of very well designed environments for both the LLM and the evolutionary algorithm to work well.
This paper here is a great example of that and it’s worth a reading.
Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve https://arxiv.org/abs/2601.21096
How many times we have to hear again about Erdös problems? :) It sounds like a great achievement for humanity at first, but after a while they keep coming back!
There are only some 700 open Erdos problems left, so when they're all solved you can finally rest.
There's not a lot of opportunities in this space yet. This is the closest we can get to High degree solver kinda of problem.
There are only 3 companies doing this to date: Google, Sakana AI and Autohand AI.
I wish that Google would focus on bringing their Gemini 3.x models to GA, and provide enough capacity such that one not constantly has to fight with 429 errors.
It often feels like they do not want me to develop applications for corporate clients using their Vertex API. It is just such a shame, given that their models were so great for document analysis etc.
Are you doing it on a free plan? I noticed they serve way more 429s on the free plan.
No, for clients we use paid Vertex AI accounts. We often need to host workloads in an EU region, which rules out “global” models (and probably better capacity).
In the past, we used a wrapper that round-robined across multiple projects to get enough quota. Luckily, many of our workloads are workflow-style tasks, so we can simply keep retrying on 429s.
Fun fact: for one of their services, I think it was Stitch, I noticed that my paid key kept hitting quota, while the free worked fine. That blew my mind.
I've been seeing the same in my product; 429s in vertex.
We generally avoid any Google AI for the most part because it's so unreliable.
This is crazy- the fact that it is helping with stuff like quantum too is huge!
I would be interested to see how exactly the agent helped. How was it used, where did it lead to the given improvement and in how far would it have taken a human to come to the same solution.
The blog post has many links to papers and preprints discussing this exact question.
The CANOS arxiv link says absolutely nothing about AlphaEvolve, Gemini, or LLMs. It seems to use purely traditional ML models. If AE did in fact write a quick script to test different configurations in order to optimize the results, they don't seem to have bothered to write about it.
I can't read the Nature paper about DeepConsensus, but from the summary, it doesn't really explain what role AE had in improving DC. It would be nice to be able to read about what role it actually played, and whether it used traditional or novel methods of performing it
seems like `karpathy/autoresearch` on steroids
A fantastically simple solution to improving algorithms, I wish I had this years ago in activation engineering: https://blog.n.ichol.ai/llm-activation-engineering-an-easy-f...
How do I access AlphaEvolve?
This is just a flex post. Be a billion dollar company or get out.
They'll likely make it available at some point, but for now one can use OpenEvolve [0] which is not quite as good but should be a good start to use the same LLM-driven evolutionary framework.
[0] https://github.com/algorithmicsuperintelligence/openevolve
There's also: https://github.com/inter-co/science-codeevolve and https://www.turintech.ai/
Your link seems completely unrelated. Why would you suggest that?
Not sure what you mean: OpenEvolve is an open source implementation of AlphaEvolve: https://huggingface.co/blog/codelion/openevolve
AlphaEvolve couples map-elites with LLMs. It's an key step in machine learning, in the vein of DQN for reinforcement learning.
AE brings diversity from the genetic algorithms community to large scale optmized deep learning and RL models.
It is a mandatory step for moving forward. The approach is clean and simple, while generic.
The only caveats is the per optimization problem definition of the map élites dimensions. But surely, this will get tackled somehow over the next few years.
If you don't know about map-elites, go look up Jean-Baptiste Mouret' s work and talks, it's both very interesting and universal.
We went from 'AI will replace programmers' to 'AI will help programmers' to 'AI writes code while other AI reviews it' in about 18 months. At this rate the humans are just providing the electricity.
RSI is here on the hardware level and on software level. Sprinkle with a couple algorithmic breakthroughs and results are nigh unimaginable.
Meanwhile Gemini CLI has been broken for months!
https://github.com/google-gemini/gemini-cli/issues/22141
Welcome to HN @berlianta; TIL green username === new user in HN; Stories posted by new users are called noobstories [1];
[1]: https://news.ycombinator.com/noobstories
From the comments it seems that this community (mostly career software people) is starting to move into a new phase of grief about the median software engineer losing their hoped for permanent place in society.
-2021-2024 was Denial
-2024-2025 was Anger and Bargaining
-2026 seems to be some combo of anger, bargaining and acceptance depending mostly on your class/age
I think we are still in the denial phase.
and yet Gemini still can't code
What I'm most curious about is how this translates to messy, real-world codebases without well-defined metrics. Most production software isn't chip design or kernel optimization - it's business logic with unclear success criteria. The infrastructure story is impressive, but I'd love to see how they handle domains where the evaluation function itself is ambiguous.
> In advertising and marketing, WPP used AlphaEvolve to refine AI model components, navigating complex, high-dimensional campaign data and achieving 10% accuracy gains over their competitive manual model optimizations.
Ah good, we're getting closer and closer to Venus, Inc. every day. /s