Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).
If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.
Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.
No sawdust at all on your job site, and you can tell nobody is cutting wood.
Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.
So it’s a pragmatic metric that got a bit Goodhearted.
And the tragedy is that this isn't sustainable, and we all involved deeply in tech know this. There is eventually going to be a big reality check the companies will have to pay, because you can't force creativity and quality, not even with AI, because actual intelligence lies with us at least for now and for the foreseeable future. However when the rope eventually snaps these executives at best will fall upwards, with big severance bonuses and a list of "contributions" we have to be grateful for. We are the ones that will suffer through the next big layoffs.
In cost per line of code, we have verified this is always an error unless your time is worth less than the machine (unlikely unless you consider your time to have no cost rather than considering it as your hourly rate).
The worst thing for our productivity has been Claude Code or Claude Cowork taking a complex problem and turning around and writing bad instructions for dumb model agents then synthesizing the dumb answers into an orchestra of badness.
The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
Agents should toil. Agents should neither think*, nor decide what to think about which itself is thinking.
* Agents should “think” like ants or bees or beavers think. Any human-like thinking, *especially* intuition-like thinking, should be thought by the best model available.
** Nobody should be “churning out code”. In a hierarchy of coders who translate detailed specs to some computer language, developers who write software that ships on a project timeline, and engineers who accomplish business goals, engineers should “churn out” engines structured for business outcomes.
Measured by that, the machine is leverage while reducing a variety of costs. At the same time, because most training data doesn't grok this, the machine doesn't grok it either. So it needs you to shape its toil.
The whole industry is adjusting to the reality that the expected output of an engineer is much higher than it used to be. It’s not local to one company. You may find a better environment for the time being, but this is the direction everything is headed.
I've been getting by on the $200/year plan by smoothing usage continuously over time.
The pay per use is for the API so does it mean you're using the API in a custom setup?
When you consider that xAI's old data center was enough to bring Anthropic back ahead, it tells me Microsoft could host their own on underutilized previous gen GPUs that are sitting there wasting server real estate.
I don't buy it. Old models such as GPT4.1 were faster than newer reasoning models, and their output was as good. Newer models end up wasting an ungodly amount of time with chain-of-thought steps which can be a complete waste of time if you have a structured prompt such as a plan or a spec.
My experience in the real world is that users have to ration requests, and x0 models actually tend to be used far more because expensive models are left for more complex tasks.
[dead]
What they wanted was for them to use both and feedback which was better.
The developers voted with their feet and didn’t use Copilot.
What Microsoft were hoping was that the opposite would happen...
This was true in January -- since then, the Copilot CLI team has spent countless hours with engineering leaders and the biggest Claude Code users at the company to understand Copilot's shortcomings, define evals to properly test them head-to-head, and close the gap between the products.
The result? Claude Code usage was organically decreasing and Copilot CLI usage was organically increasing -- when this announcement was made, internal Copilot CLI usage had been greater than Claude Code usage for weeks!
I've tried throwing unsupervised agentic software factory workflows against the wall, and they burned through my tokens like nobody's business but didn't produce much.
Supervised, human-in-the-loop process on the other hand is much more productive but doesn't consume nearly as much. Maybe that's why everyone's pushing agentic approaches so much.
Dealt with that by going all out and making an agentic parallel code review skill. Basically an infinite TODO list generator. Now I'm definitely getting 100% of the usage I paid for. It really burns tokens like nobody's business, and catches a lot of issues while at it. I've been looping this review/fix process every week. It's dramatically reduced the amount of stuff I need to pay attention to during my human review sessions.
> I understand that Microsoft is planning to remove most of its Claude Code licenses and push many of its developers to use Copilot CLI instead. While Claude Code has been a popular addition, it has also undermined Microsoft’s new GitHub Copilot CLI coding tool — a command line version of GitHub Copilot that runs outside of development apps like Visual Studio Code.
And people here are interpreting this as related mainly to the Claude burning too much tokens too quickly and suggesting Microsoft should rather use SomeOtherLLM©?
Is this Hacker News or rather Marketing Wars?
I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.
With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.
Between Copilot, Claude, and Gemini, I still actually prefer Gemini. I do a lot of scientific writing in addition to coding and Gemini is the only model I can trust to “just be right”. This trust then transfers over to its code output.
This would never fly if stock market was rational. But it never is.
I expect the r/LocalLLaMA guys to be going nuts about this news.
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time.
I suspect they weren't as efficient as they could be with token use either. Sounds like they were trying to encourage non-developers to vibe code stuff
Github Copilot offered probably the best value and was IMO underappreciated for a long time; I've been an annual subscriber since day 1.
The changes announced a few days ago completely revoke that value proposition, I doubt I'll continue with it.
Arguably, Copilot is GPT 5? Not sure what the CLI offers behind the covers.
The CLI can swap to whatever model (/models) based on your subscriptions.
The copilots on desktop or Office Apps are likely just GPT5 nano or other tiny models with cheap inference
https://github.blog/news-insights/company-news/github-copilo...
Claude tokens are priced by GitHub at a disproportionately premium price compared to Gemini and OpenAI. I wonder why?
https://docs.github.com/en/copilot/reference/copilot-billing...
Speed without judgement always compounds badly.
2) Opus is not even unambiguously best at coding anymore. GPT 5.5 splits that title for some time now.
3) I would have probably done the same in his position. Dogfood the product.
Similarly companies seem to reward high token usage as a sign of someone willing to play ball with AI and again have forced higher costs on themselves for people reward hacking or using tokens out of spite.
Also it became very hard to convince management to keep both Claude code and GitHub Copilot enterprise licenses.
This is a warning to any company, not building their own AI, that AI assisted development could become really expensive really fast and most likely won't pay off. What Microsoft is suggesting is that the current price is to high, but it's still not high enough for e.g. Anthropic to be profitable, or AI coding tools are only as good as the developers using them. So you can't meaningfully do layoffs by replacing the developers with AIs, because the cost is to high.
How does Microsoft plan to fix CoPilot, so that the cost will be so much lower than Claude, that budget overruns won't be a problem for their own customer?
There are papers describing KV cache precomputation for commonly used documents (e.g. KVLink), but, of course, it's not a priority for model providers: they'd rather sell you more tokens, also they would rather get to AGI/ASI first than optimize usage of existing models...
I found Opus 4.7 to be slow and wasteful with token usage. It's shocking how inefficient it is with tasks like bash tool usage and web searching, delegating them to a dozen subagents only to get stuck and never return until you esc and intervene. That, in addition to all of the broken tooling Anthropic built in to limit token usage like the broken monitoring tool made managing Claude a chore. I was happy to pay $200/month for Opus 4.5 when they had more capacity, but 4.7 felt like a huge step back and no longer worth the price and inconvenience.
I remember an OpenAI employee comment on the GPT5.5 release post about how they specifically geared it towards long-horizon tasks and its been a breathe of fresh air in that regard. I have five two-week long sessions going right now and there's been no degradation in performance or efficiency. It's much better at carrying rules/learnings forward even in long-running sessions and grounding/refreshing itself in verified facts when it loses context.
Its funny because in two weeks I've gotten way more done with GPT5.5 with way fewer tokens and way less handholding. I think this goes to show how important tooling and the harness is and how a capable model like Opus 4.7 can be severely handicapped by bad product decisions.
If anything, it's forced dogfooding, i.e., forcing their own workforce to beta-test their product.
Side note, it's so frustrating that The Verge puts a paywall at the fold. It makes me feel like the rest of the story is not worth reading. I'm not inclined to pay $2 to read a link that was posted on an aggregator.
[dead]
[dead]
[dead]
[dead]
At least Codex is trying to win validation on merit.
% of AI suggestions accepted vs. edited is also a BS metric that Anthropic et. al. like to push, similar to LoC, because they're large numbers and large numbers must be good, right?
Well guess what, I have auto-accept on and then adjust after it's "done". And I do it by telling it what changes to make and those have auto-accept on as well. That's quite a high "accept" rate, by definition. But in reality it may have churned on 50% of the lines it generated and auto-accepted first.
I disagree. It’s a valuable metric if you are building an agent / skill infra layer.
Think of it like error rate on your API. Green metric does not mean your system is healthy, but if it’s red you have an issue you definitely need to fix.
Your example scenario is detectable in the non-naive implementation anyway; the o11y layer (usually OTel these days) tracks the trajectories, links them to the diff, and attributes each hunk as coming from the session or not.
I would ask you tho: What incentive do AI vendors have to even try and detect this? It's in their interest to use the most naive interpretation, i.e. what my original comment mentioned, as it shows how "good" their models are, coz nobody ever changes much if anything ;)
Never mind that they really can't unless they're going "creepy mode". If I use Claude/Codex et. al. to agentically write something, then let the session just sit while I go about in my IDE changing things and then I commit and push, are you telling me that the vendors do or should track all of the changes made to the files they touched and report back to base what got overridden by me, the human?
But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."
Remember that the entire mantra of "productivity is a measure of how many shovels you break and replace" is only ever echoed by the one selling the shovels.
this has to be the worst metric.
anytime the llm wants me to read a diff of one file, im just gonna send it forward so i can read the whole diff
We may be on the cusp of the AI age's new era of 'measure twice, cut once'.
When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.
My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.
""" Steve Ballmer In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing. """
They call themselves "risk takers" to justify their high pay.
the companies will have to pay, because you can't force creativity and quality
Most companies do not care about quality.
_users_ who have to interact with that software will pay the price.Exemple from one of the wealthiest company in existance, for one of its most strategic product: I was trying gemini-cli on some mcp servers just yesterday, with gemini-chat helping me configuring everything. In less than 10 minutes, I stumbled upon 3 or 4 different bugs. Eventually, even gemini-chat recommended that I throw gemini-cli in the bin and move on to another agent... That's the new norm.
Have you seen the state of current corp software? I'd say a lot of creativity is still very much needed. Let's see how long this is sustainable.
> would anybody be really sad if this work is overtaken by LLMs?
I'd not be sad about the job itself, but the dev which had a mortgage to pay but now is substituted by a machine churning crap code while their superiors get sore from patting themselves on the back.
I know from personal experience that once you fix a bug introduced by Claude, Claude tries to recreate the bug every time he edits that code again!!
I don't care bout cost, I care about getting good results fast.
Cost per line of code is not a suitable metric for anything. It's as silly as measuring engineers' performance by lines of code. More lines of code is worse than fewer lines of code. When you say "we have verified" whoever that "we" is makes a big difference, but you're posting pseudonymously, how are we to even guess at that "we"?
I get better results with some older cheaper models, faster. In particular older Claude models than Opus 4.7. Maybe the more expensive model churns out more lines, more complexity faster. That is a worse outcome for me. The complexity must be avoided at all costs. The simpler, smaller, answer is always better, and scales to bigger code bases. The more the model guesses at intent rather than checking intent, the more the model is clever rather than clear and simple, the worse the outcome, the more that the model turns into an architecture astronaut, the worse the outcome.
Only cost for effective* outcome matters. And if your lines of code have a cost, you would want fewer lines of code to achieve the outcome, not more.
Are you sure you disagree with that?
* If your place of work starts talking "efficiency"**, run. Find somewhere the conversation is *effectiveness* — at the goal/outcome level.
** Not to mention that "efficiencies" is MBA speak for "right sizing" away effectiveness.
I haven't seen "just absorb a giant ball of context and do the right thing the first time" be cracked yet, even for Opus 4.7.
At the end of the day, code is code, and we have decades of lessons about how to make code more reliable and maintainable. Composable small modules, not god methods, are still the way to go, and they reward devs who use them to get focused context for agents with faster - and often better - results.
While they obviously want a high quality product, no outages, a responsive system etc, I don’t think they necessarily understand why you need to avoid creating god-objects, need to reason about abstractions, etc.
Most environments only care about the output. In the case I'm thinking of, Software made it perfectly clear to Management, most of whom were former engineers, that the product desperately needed redesign in some ways. But as long as the cost of that redesign exceeded the cost to get the next version out, it could be postponed. This went on for years.
It’ll take production incidents, impacted customers, and brand damage to make the executives start to prioritize quality over quantity again.
Besides, it's probably counterproductive in the long run to think of strong worker rights as being opposed to the employer wanting higher productivity out of the worker.
But I'd agree that everyone can start planning a career shift that'll span a few months to some years in order to seek better working conditions. Passively accepting all work degradation because that's life and money is needed is partly responsible for the current situation too.
Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
I’ve been able to implement very large features using frontier models, but the code needs to always be revisited.
AI can do two things: find vulnerabilities, and prototype code. It cannot design software, and any appearance of such is an illusion at best.
We don’t need to produce faster to be successful, we need to create better, long lasting products.
This is why I have switched nearly all of my personal coding experiments over to Qwen3.6 27B. Opus make it easy to gloss over too much and to delegate too much. And so I don't build sufficient memory of the code to provide long-term oversight.
But Qwen3.6 27B sits on an really interesting balance point. It understands code well enough to get 80% of the way to a good design, and it can fully implement a well-specified feature. But if my understanding of the code starts to weaken, things start going wrong much more quickly than they do with Claude.
Opus will happily take complex code beyond the point of salvation, if you allow it. I'm currently cleaning up a successful prototype code base right now, one that was partially vibe-coded and now needs to be put into production. And Opus generated massive amounts of tech debt. So clearly people who lean into vibe coding will need to keep upgrading their models for many years to keep up with the mess created by earlier models.
Copilot switches to API pricing starting next month (let's see how long it will last for our $39, and $19 since September), Anthropic switches all corps into API based pricing. From the most popular choices I think only Codex didn't switch yet (although it is hard to tell because I don't know their enterprise pricing).
I have DS-V4-Pro agents pretty much running 24/7. The cost is inconsequential. The same cannot be said for anything from Anthropic.
Consumer sentiment is in the gutters certainly. But objective measures of the economy like unemployment and real wages look good to excellent
Underlying model choice still has no restrictions. Opus 4.6 is by far the most popular. there's still big $$$ bills going anthropic's way.
I had far more hallucinations with 4.7 than 4.6.
I'll try it again after a few more months for them to get it right, but 4.6 is what changed my mind on LLMs as a tool, and 4.7 felt like a step backwards, so for now I'm sticking with something that has delivered me value, instead of arguing with a model ostensibly better that was making shit up 1 - 2 times a day. It was really disappointing.
I can give examples if needed, I screenshotted the most aggravating ones, but what worries me is which ones I didn't recognise.
Honestly I find GitHub Copilot CLI (and now also the new GitHub Copilot app) quite decent. I mostly use it with Opus 4.7, or rarely with GPT-5.5. The VSCode extension is ok, but CLI or app are the better experience IMO.
https://code.visualstudio.com/docs/copilot/reference/workspa...
MS thinks CoPilot is the Clark Griswold of LLMs when it's really Cousin Eddie...
These days I just use Claude Code Desktop or Claude Code in powershell. Standalone, not inside and IDE. Honestly, I'm using Desktop more and more as it gets more features.
The IDE is for me. No AI in it at all. If I want to get Claude to do something specific to a file I just @ the file.
[dead]
Technically we're using Copilot and we're playing for it through Microsoft licenses, but it's using Opus 4.7. Even before this, most of our custom agents within m365 copilot were one of the GPT models.
Or maybe you're right and they want their developers to use the copilot models.
Obviously you want to be aware of what else is on the market, and use the right tool for the job -- but equally if you have a directly competing product, you'd prefer your org's telemetry and suggestions are directed towards improving your own software rather than your competitors'.
Compared to working at other big techs, where I was able to direct msg the engineers on the team for internal protobuf or datalake services in addition to user groups that were generally responsive it was just strange. Also Microsoft doesn't have a monorepo so you can't just commit patches to their service because you don't have access to their repos which I pretty regularly do elsewhere.
The Copilot CLI has ushered in the beginning of a change in this dogma -- I've helped dozens of Microsoft engineers get access to GitHub source code so they can contribute to Copilot CLI! It's fun to subvert expectations when a Microsoft IC pitches an improvement and I can respond with "submit a PR!"
There's a large (and growing!) contingent of people who don't write code these days. (Many don't even use the keyboard.)
I think Kiro might have some “first mover” advantage internally, but CC feels better to use.
GitHub Copilot is in a somewhat similar place as Microsoft's toy but still different -- it was more or less the first coding agent/assistant, and GitHub/VSCode/Microsoft has enough user base and impact to influence individual users and enterprises' choices.
For Amazon's coding agent -- I just never see anyone outside Amazon even mentions Kiro or Amazon Q. Maybe a little bit when Kiro was offering tons of free credits. But I don't think it's even remotely relevant these days. I don't see news about companies adopting Kiro.
To me, it's just a matter of time before they are sunset, like Chime or a bunch of AWS products.
For Kiro, I agree with you, it seems like wasted effort and Anthropic / OpenAI are miles ahead in their tooling.
I love AWS at the infrastructure level, but their PaaS tends to be meh, and their end-user directed stuff is usually atrocious.
[deleted]
There is this real danger that our thinking, and the things we make, become bloated without constraints.
IMO software has gone to shit since both mobile phones and laptops mostly have massive amounts of compute. We always seem to use it to the limit, just because it's there.
At least it's doing something productive instead of just sinking money into literal gambling simulators. Mercifully, unlike video games, automation is not "cheating".
https://github.com/matheusmoreira/.files/tree/master/~/.clau...
There are many "critics", one for each quality I want reviewed. Correctness, consistency, maintainability, security, testing... Everything I could think of, and I keep adding more.
https://github.com/matheusmoreira/.files/tree/master/~/.clau...
The scrutinize skill is the entry point. The Opus I'm talking to becomes an agent coordinator. He explores and autodiscovers the project's structure, subdivides it into logical sections.
Then he runs a truly absurd critic x section matrix against the entire project. Literally hundreds of these agents running in parallel, each focusing on one area. Ten minutes of this is enough to exhaust my Max 5x five hour window and put a serious dent in the weekly usage numbers.
It literally takes days to run a full agent sweep. I designed it around the rate limiting. The agents do file system style journaling in order to resume cleanly. They commit all of their findings as they go into an orphan branch in the repository. Further review runs can build on it and avoid searching for known issues.
The way it works in practice is I just run /scrutinize sweep and then go work on something else, or just go do my actual job, live my life, play video games, write an article for my blog or something. Come back five hours later to either resume the process or check the literally hundreds of issues that have been found by all the agents. Then Claude and myself will go in and evaluate and fix all of those issues one by one. Then review again. Then evaluate/fix again. I'm just gonna keep looping this over and over until zero issues are found. For all of my projects.
Going from solo hobbyist programmer to this was pretty insane. I can only imagine what these corporations with infinite money must be doing.
[dead]
Isn't this a (mildly exaggerated) description of AWS, which is a very successful service?
So your costs scale with the number of users you have.
Thats an op ex that you can explain.
For tokens for developers its maybe closer, cost/outcome wise, to hiring an external consulting company to write your code; money paid scales with work done, no promise of delivery, arbitrary unpredictable external price changes.
Its not quite the same; though, similarly lucrative for consultants.
Like the other commenter said: cloud spend can also spin out of control if you don't pay attention, yet we've found ways to keep it under control (training, guardrails, limits, transparancy).
Personally, this feels like its just trying to push the work of managers in allocating resources onto developers so that they have more work to do and can be blamed if anything goes wrong.
people still can't get over the unreasonable effectiveness of algorithms.
Colleague used Sonnet 4.6 on some pretty normal agentic coding tasks through AWS Bedrock to keep the data in the EU, 100 EUR usage in a single day. In comparison, the Mistral subscription costs about 20 EUR per month and we tested that for similar tasks it was okay, the usage got to around 10% of that monthly limit in a single day. Or Anthropic's own Max (5x) plan where you get way, way more tokens to do with as you please.
I feel like the sweet spot is having a monthly subscription with any of the providers (you're subsidized a bunch), but if you have to pay per tokens, now I'd just look in the direction of what tasks DeepSeek would be okay for, sadly probably not in the situation above. For a startup, though...
On the other hand, this feels a bit hypocritical:
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time, and sources tell me that Claude Code has proved very popular inside Microsoft over the past six months.
They're gonna say that the future is all AI... until they get the bill.
I upgraded my plan last night to Mistral Le Chat Teams. This now costs me €60 per month for two users. Limits have been reset, but I have no idea now if my per seat limit is higher than the Pro plan, or if the limit is shared between the seats, it’s really not clear. I guess I will find out next month. The limits reset on the first of the month and I really hope I don’t hit them in the next seven days.
I use Mistral Vibe CLI and I’ve written and implemented a couple of new skills[1]. Caveman, based on an idea I found online somewhere, this skill removes all extraneous response text, including articles. Makes for some fun reading, but supposedly reduces output tokens significantly. Hash-anchors, this one is based on a concept from Dirac[2], reduces search failures and also includes multi-file dispatch. It will be hard to measure, but Vibe tells me these two should result in roughly a 40% reduction in token burn.
The results for a function implementation and test of levenshtein distance in js are pretty similar but Mistral is 30x cheaper than Opus 4.7 and 4x faster than Sonnet 4.6.
[deleted]
I mean, the will continue to say so, they just want to be the ones being paid for the service, not anthropic :)
I tend to work with the agent, and observe what's going on as well as review/test and work through results/changes. I spend a lot more time planning tasks/features than the execution, even using the agent as part of planning and pre-documentation. It works really well. I don't think people burning through the 5hr allotment in under an hour are actually reviewing/QC/QA the results of what they're doing in any meaningful way, and likely producing as much garbage as good (slop).
I'm really curious as to HOW the MS employees were using the agents as much as what they were doing.
By buying a subscription and dealing with the limits, using claude code and paying per token seems like the fast lane to the poor house.
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Me: Are you sure?
Claude: Well actually there is a bug <more random stuff that looks right this time>
----- Now it is:
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Claude: Let me consult the advisor on that.
Claude: advisor came up with some advice, adjusting according to that. <more random stuff that looks right this time>
No public forum is naturally immune to the spread of (guerilla) marketing. [1]
[1] Internet Rule #48
Eso mensaje de hijo de Carlos
After 2 weeks of Claude getting progressively worse and worse, today was the final straw.
I don't care if they have a phone app. The model is COMPLETE garbage after you subscribe long enough and they think they've "got you".
I can't code on my phone if the model literally moves in the wrong direction and does the opposite of what I tell it to. If I wanted to make my code worse, I'd just randomly commit garbage. I don't need a mobile app for that.
When you're on a mature codebase with 500k+ lines of code, I haven't seen anything else be as effective as 4.7.
It was given very simple ways to verify success. It simply didn't do that and said it's at a good stopping point, despite moving in the WRONG direction not even doing 1% of the task, and being told to see the task through to completion.
Meanwhile, Codex broke it down into 3 steps and just got it done...
No, "I'm going to give it to you straight, this is a large risky commit that could go sideways, so I'm just not going to do anything instead."
Claude worked on it for almost 200 commits over 2 weeks, needing to typically prompt it 3x to even TRY to make any progress instead of just wasting tokens to ignore me and tell me how big and risky it is.
Maybe Claude is just particularly terrible at this type of refactor. I'm not sure why that would be.
People heard "Claude is nerfed" and now they see it everywhere, they notice failures a lot more than they would have otherwise.
Doesn't matter that Claude is not, in fact, nerfed. Perception is powerful and most humans are not rational.
However that's just it, you just need to improve and make clearer of your prompt and it will perform just as good.
Opus has been dumb this week.
Claude was having a lot of capacity problems and downtime and then this week that has been much less obvious... and the model is dumber.
It could also just be luck and my impressions are false... who knows.
It's a good thing that hype-chasers are cancelling though. So we can use the services with a reasonable latency.
Tell it what to do.
Commit, push to origin, review on GitHub.
Tell it to make changes, amend the commit, push --force-with-lease.
I'm attempting to make a memory safe language like Rust but with a substantially lower learning curve and added safety (but non-zero cost abstractions) fully with AI, almost entirely from my phone, commuting, getting coffee, walking the dog, between sets at the gym, replacing doom scrolling before bed and during lunch, etc.
Mostly to test how much LLMs can actually scale development.
Depending on how long it takes them to clean up some architectural slop in the MIR lowering phase, the results could either be very impressive or not.
From a purely cost basis perspective, it's hard to argue they aren't killing it.
But from a multiplier perspective, it's up in the air how great they are.
It's proven to be a really nice experiment, because much of what I wanted to solve with a language is the problems inherent to LLM development.
So at the self hosting phase, I get a great opportunity to see if the language can actually deliver on what I dream for.
This was all supposed to be worked out prior to Cloud Next, but it wasn't. Ironically, they mentioned Claude in a few of their presentations at next.
And that was our solution. We are a big GCP customer but our whole team is on Claude now and much happier.
1. right now, usage correlates with experimentation and learning, few if anyone knows how to make these things effective on their own over long sessions of activity
2. long term, you should be using more than one agent at a time, because they are running in the background based on events (new direct message / something happened in eg. github)
I wonder if this will happen before they have some obligatory debloating of the investors exposition to the company.
This is, in my opinion, tripe. SWEs are being laid off because of post-Covid over-hiring. The only evidence for labour destruction is in junior hires. But not because anyone is being fired, but because entry-level jobs are being cannibalised.
Nobody can make a profit with AI. Any clever idea can be cloned with AI, competition makes it unprofitable. No moat, no arbitrage opportunity. "During the gold rush, the only people making money were the men selling shovels."
We can definitely do amazing things with AI, and it makes us have superpowers, but so does everyone else. My competition also uses AI. I have to keep up with an AI powered competition now.
With research and hardware near guaranteed to bring the efficiency way up, I'm not scared here of massive price hikes.
There is no moat.
So you're getting 2 for the price of 1.5. Scale that up to 500 devs at a big company and it's a big chunk of change saved on payroll.
Keeping your headcount or hiring humans instead, AI would have to start to cost upwards of $15k/month/developer or more before it costs more than hiring. You're looking at about 4 billion tokens per month before humans start to break even or are cheaper.
But even taking a more realistic 1.25x (20% time savings) gain, lets say you drop from 500 to 400 devs, you'd have to hit around $4,000/dev/month in token spend before hiring humans again would break even.
Payroll is just expensive, in most companies it's by far the biggest expense. AI still has to cost drastically more before investors would call it out as being worse than increasing headcount, from a pure dollars perspective.
While LLM Opex is "some future quarter" and very easy to co-mingle with other expenses.
[deleted]
[deleted]
Changes to GitHub Copilot individual plans
https://news.ycombinator.com/item?id=47838508
GitHub Copilot is moving to usage-based billing
https://news.ycombinator.com/item?id=47923357
Multipliers for annual subscribers:
https://docs.github.com/en/copilot/reference/copilot-billing...
When I am away from home I’ll run autossh on my dinosaur road laptop (which probably has 8MB video RAM lol) to connect to the home PC’s LLMs. Gemini assured me that this should run well over my intermittent cellular connection.
You just saved me some headache and money :-D
New pricing model changes that. I will still keep it around for autocompletion (for the rare times when I open up an editor).
It. is. so. bad.
It feels like it's at least 1-2 years behind the current top models.
Should Microsoft have stopped dogfooding because of Zune?
They are worth multiples of trillion dollars so I would wager they know a little more about success than you and me.
Fun fact, up until you face a consequence for crime, all crime is free! Have fun and go win the competition game against your co-workers.
Smaller companies will have departments that distill larger models into something more specifically manageable and useful for them. At least, that's my personal prediction :)
I do think your prediction makes sense, because the AI really isn't the product, it needs to be baked into something and licensing the models saves you the R&D and cost of implementing your own.
In order to do that they'd have to make a concrete business case to justify the headcount and compute costs. They'd be facing the same fundamental economic problems Anthropic, OpenAI, MSFT, etc are facing just at a department level instead of a megacorp level. I hope they try it, sunlight is the best disinfectant.
However, when the pressure is turned up and people have to actually show results--and, like, be accountable--instead of just buying a subscription and externalizing the accountability, I don't think we'll see so much enthusiasm about AI coding. Whether or not an engineer is actually more or less productive with AI (not merely whether they feel more productive) will begin to matter a lot more. I don't see how people continue using AI in this hypothetical small company under those adverse conditions.
There may be a spot of “good enough to pay for and make a profit” that exists.
The frontier model space costs 1000x as much to develop as the small language models, and is only 1.5 years ahead.
Factually, the frontier models have not paid for themselves. So, if you're MSFT and Apple, you don't need to run in a race where even the winner loses massively.
You can try to train models 1.5 years behind that are highly likely to be profitable, given your market position.
The average person is lagging behind what AI is capable of by 3+ years anyway...
So you can save 1000x on training and 10x on inference and just use SOTA small models.
Why spend $5B training a model that's for sure not going to make $5B (after inference costs) when you can spend $5M building one that WILL make far more than that after inference costs?
At one point there were rumours that they'd do that. They also have the rigts to oAI models for a few more years still, so they could always use that but apparently they're also compute starved (like anyone else).
Normally KV cache works only if your context prefix is identical, but there are papers which demonstrate documents can be cached between different contexts.
[deleted]
call me a luddite, i'll be wearing it as a badge of honor
[dead]
Exactly.
No more than sitting down and writing code before a product concept or spec or architecture comes out right the first time, or fifth.
Absorb the concept, make a shape of outcome, then a spec, then hold its hand to architect a series of iterations, either component by component or thin vertical slice or whatever combination lets you iterate in working increments...
Your brain, machine leverage. After all, it types faster than you. But it should type what you want.
You know what it should type, right? If you don't, you're gonna have a bad time anyway.
The expectation of higher productivity measured by completely useless means, letting a highly qualified employee jump through hoops for the amusement and misconceptions of the C-level.
Opus is relegated to the planning / design phase.
Have you tried Claude Opus 4.7?
For example you might have a great design/architecture session and then run out of context. The next agent tries to piece things together from fragments of conversation and such. But it often starts going off on tangents, searching overly broad to understand, misses cues and nuance, all-the-while burning tokens.
As other articles have put it: AI makes doing the easy things easier and the hard things harder. Because hard things require creativity.
To bring this back to the original post: companies need people, and they shouldn’t expect that they can fire half their workforce and replace it with AI. Quite the contrary. The faster companies move with AI the more technical debt they’ll end up with it’s a guarantee.
“If you want to travel fast, go alone. If you want to travel far, go together.”
Oh hell no, ever since the tail end of Biden the trend for unemployment is showing upwards when corrected for seasonal effects [1], and for real wage growth the situation has been worse for an even longer time [2] - if not for the effects of the post covid stimulus packages plus emergency wage raises following the energy cost explosion thanks to the Russian invasion of Ukraine.
The story the stonk markets tell is completely decoupled from reality, partially because the AI wash trading bubble keeps distorting the statistics, partially because no matter what the stonk markets only can grow up because pension contributions keep blowing up the market [3]. Not getting that difference was what blew up Biden's reelection and is now screwing over Trump.
[1] https://www.bls.gov/charts/employment-situation/civilian-une...
[2] https://www.atlantafed.org/research-and-data/data/wage-growt...
/model command returns only 4 choices for me: Opus 4.7, two Sonnet options and Haiku.
/model claude-opus-4-6[1M]/model claude-opus-4-6[1M]
⎿ Set model to Opus 4.6 (1M context) · Billed as extra usageMaybe this is becaus I'm on api pricing? (All new contracts for corps are pushed to that by Anthropic).
[deleted]
For anyone else who may want this, use: export ANTHROPIC_MODEL=claude-opus-4-6
4.7 IMO is around 10-20% worse at understanding your prompt intention. You need more effort to explain your intention clearer so it doesn't divert.
Looking now I see that "Opus 4.6 Legacy" is an option that was not there before, so maybe Anthropic noticed that others are having the same difficulty.
Although GPT's been acting weird since Thursday...
I've spent the last couple of days building Swift bindings to a monster CPP lib and I've actually had fun.
And you get a token based pricing since June 1.
Personally, I looked into Copilot's prompt and saw things that made me put it down immediately to start working on my own. I'm now using OpenCode for reasons and I like it better than any Big Ai tool. Using OC with Qwen3.6-MoE (for context) and generally happy with the results.
At the moment it seems like the way it's been trained has been tightly coupled with grep.
It does feel bizarre though that it doesn't use the symbol servers.
Especially if you want effective results.
Actually due to stupid billing system of github which charges per "premium request" instead of tokens, you could and still can abuse it so it costs nothing. They're changing it from next month to usage based billing though.
This doesn’t mean much if you are using a terminal editor.
if your response is "my prompts don't produce code that needs values flipped, ever." then I would wager you're only touching very simple things with an LLM.
for me I don't care about the token cost and prompt writing so much as the fact that it's just faster to change 0 to 1 and leaves me twiddling my thumbs for an llm output less.
On balance, and via dictation, it feels likely to be faster overall to just enact the changes I want 'inline' of the conversation thread.
Is this stuff any better now? I think current harnesses probably do have things like file change listeners that automatically inform agents before they act on a file they've previously engaged with if it has changed in the meantime.
Having said that, I fear what June 1st brings for copilot It might suddenly be very useless for me.
That said, I never tried copilot.
I can also click on a file referenced by the AI and have it open immediately in the IDE so that I can inspect it.
Finally, it is a pain to write long, multi-line prompts in a CLI where you can't easily click around to edit different parts.
The primary weakness I've found in IDE based UI is that it struggles to get through the corporate security in order to run commands.
Tab completion.
Smart model can cut down time to write complex firewall yaml dramatically, relying both on the existing file and the ugly draft (eg comma delimited details of the rules I need) I put out. It makes it 5 minutes lead time and 20 presses of tab instead of writing a shell/python full of edge cases or just copying existing rules as a template and laborously editing them -- smart model knows what the specific firewall needs.
But I'm not a developer, so I use both - haiku via github for tab completion and CC for cli.
All of them are valid usecase of VSCode CC extension for me.
They were a manufacturing org and only managers had licenses to MS Office and users in Active Directory. Everybody else was registered on a separate OpenLDAP directory to avoid paying MS licenses.
Chime was cheaper per user than onboarding everybody into AD and paying Teams, and they could tack Chime usage into their AWS bill.
I'm currently, very painfully, removing a tiny bit of tech debt at a time from a massively complex project that we inherited from a 3rd-party vendor. Some of the tech debt is AI-related, some because it's a vendor who rarely has to maintain anything they create, some because when we first inherited it we had no grasp on the entire codebase and were just trying to change the plane wheels while flying (we still are).
What I'm doing now is the hardest kind of programming imo. I spend hours/week just meditating on how to chip away at this out-of-control codebase, figuring out how I can surgically remove some leaky abstraction that's spawned 5 cousins w/o disrupting the whole project. I'd be fascinated to see if the latest frontier model with a system like yours can actually help me. But I don't have the time or desire to invest the months of trial and error that I'm sure it took you to get to that point.
In my case Claude saw that code review was my main activity and that I was manually and repeatedly asking claude to "review X, Y, Z..." so he suggested turning it into a skill. So I fired up the superpowers:brainstorming skill and bikeshedded it until I ended up with this heavy duty massively parallel super reviewing super claude. Refined it a bit after a couple weeks of use and the result is what you see in my repository.
My lone lisp project gets the most love. I spend weeks reading, reviewing, restructuring and rewriting everything. It's the project where I'm concentrating all my efforts. Everything I push to master is absolutely my own work and I do want everyone to read it.
I had no trouble letting Claude take over maintenance of my static site generator and virtual machine orchestration scripts though. I wanted to care but... I didn't. I did glance over the finished product just to ensure it wasn't going to nuke my laptop the second it ran, but that's pretty much the extent of it.
https://www.amazon.com/Passion-Lubes-Natural-Water-Based-Lub...
> This product is out of stock
Ah, shoot, there go my weekend plans. Bummer.
[deleted]
Not if you're using it for running builds, running research jobs, model training, etc.
[dead]
Yes, but in a "oops this is gonna take another two months to finish" kind of way, not the "oops this is the 12th time this month 8 developers have burned $2K in tokens in a single day and no one really knows how it happened" kind of way.
I get the anti/skeptic sentiment. I've been called a lot of horrible things by a vocal contingent when they hear that I help train folks to learn software engineering best practices and then apply AI to that.
Sidenote but I hope everyone realizes that 100 is kind of arbitrary here and does not mean the total chance to to get something is 100%.
If this is the "analogy" you go for, you don't seem to be suited to make that comparison.
Levenshtein distance is not only a well-understood problem, it's small, self-contained, and extremely well-represented in the training data. The kind of problem where even small/bad models can excel. The golden standard for those tasks is just "use a library" so no wonder the beefy models are expensive: you're chartering a commercial airplane to go grocery shopping.
My personal benchmarks are software engineering tasks (ideally spanning multiple packages in a monorepo) composed of many small decisions that, compounded, make or break the implementation and long-term maintainability.
There's where even frontier models struggle, which makes comparisons meaningful.
It’s making guesses not decisions, framing as decisions will lead you astray to wasted time and tokens.
It’s vaguely productive to tell them a ton of relevant info upfront attempting to minimise their need for load bearing guesses. I say vaguely because obedience is generally only around the level where it's good enough to lull you into a false sense of security, not to actually be obedient.
It’s a bit more productive to use the various loop mechanisms (hooks, /goal etc) to evaluate each end of turn against guard rails and reject with clear instruction on whats unacceptable. Obviously if you only do this without the front load of info then you’re likely to spend more tokens to reach a satisfactory end of iteration.
Breaking code up into composable chunks has worked well for me over 50+ years as a professional software developer, and I can't get away from the idea that it is still usually the way to go using agentic coding tools.
Anyways, please take your discourse of calling people you disagree with "shills" back to Reddit. I'd much rather engage with someone debating the merits of an argument.
You should also check your LLM prompt for HN comments, because the original comment you replied to was not anti-AI, and, in fact, very much pro-AI. The only criticism it had was about model being degraded, so they could not go as hard at AI-assisted development anymore as they used to before. I guess it's a bit difficult for LLMs to spot the difference and make proper conclusion for now.
Also even if taking you seriously — how does writing "no, model performance is not degraded because I say so" serve as correcting misinformation? It only does if you are shilling for Anthropic (which you do), otherwise it's just hot air.
LLMs don't really scale if you're still the bottlneck, or they only scale as much as you reviewing every line of code - that's not that much scaling...
So I try to only review certain parts, like making sure they aren't changing tests to allow architecturally broken code to slip through (because they regularly try, even when given explicit instructions not to). Or if I'm watching them make changes on my phone and see that they are clearly doing the exact opposite of what they're supposed to be doing (regularly if I'm watching).
#2 -> if commits are small, GitHub's setup is good enough that you can review code on your phone.
#3 -> if they're huge, I can just review on my laptop at lunch or something.
Theoretically, all of this can be solved easily with orchestration and require minimal oversight.
If you're using LLMs to write code and you're carefully reviewing every line with a jade-handled magnifying glass, you're not really scaling - at least to the degree I'm interested in.
This only works if there's no consequences if your code breaks. In the eyes of other humans you're responsible for what you commit. No amount of "scaling" will change that.
[deleted]
It's not a bad idea to skip it and wait until the next model release, but I personally will stick with 4.7.
Actually it’s really its own thing, I don’t think the slot machine analogy works too well, you also have fixed odds (and you know they aren’t in your favor), and a binary output
With employees, there's a lot of punishments in place for people to not want to mess up. Loss of wages and reputation, prison time,... Startup do not fail because they have a bug-ridden product, they fail because of the market.
With AI, all bets are off. They're not aligned with your goals and it's very hard to discern when they go off unless you're an expert. And if you are one, at best it's just a slight boost in typing especially with all the works involved in software development.
Either way, I don't see much point of intentional austerity in times of extreme growth. There will be time for austerity once the growth ends.
> "no, model performance is not degraded because I say so" serve as correcting misinformation?
Because zero evidence has been provided other than feelings. That is not evidence of degradation, and we know they don't serve quants.
Those people, unlike you, are actually using AI in development. And it is not a singular person who reports their frustration with the model being degraded after a certain period of time, so the anecdata does gradually become data. Your attempts at gaslighting are weak, you should really ask your bosses for a new guidebook on how to deal with reports of models performing at worse levels than before. Just writing "because I say so" is not cutting it.
> "we know they don't serve quants"
How do you know that unless you are working at Antrhopic? Yet another evidence of you being an Anthropic shill.
> so the anecdata does gradually become data.
No, it does not. Countless social phenomena demonstrate how factually incorrect misconceptions spread rapidly. Frequency illusion is real and contagious.
> How do you know that [they are not serving quants]
Lots of ways to tell, if you weren't busy calling people shills.
First, Anthropic and OpenAI have both stated they don't serve quants. Weak protection, but it's there.
Second, no one has shown an A/B or eval proving a regression.
Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution. Despite having access to this data, no one has any evidence proving a quant has been served.
> You are an Anthropic shill
I'd explain the reasons I favor Anthropic over the others, but you'd just go back to yelling "shill" instead of engaging in a real conversation. That said, I am a fan of GDM as well, and think Gemini is better than Anthropic for everything other than code.
I've seen nothing resembling sane, reasoned thought from you in this thread. Just anger.
You haven't substantively debated a single point, it's like "shill" is the only word in your vocabulary. Again, this isn't Reddit.
> No, it does not.
Yes, it does, it is literally the definition of data - collection of points, observations, anything really. Try gaslighting harder, Anthropic shill. As I said, ask for better playbook on how to deal with people actually experiencing degradation before replying again.
> First, Anthropic and OpenAI have both stated they don't serve quants.
What's the point of stating this other than trying to pad your baseless "proof"? LLM-level argument.
> Second, there have not been evals showing a real regression test proving that a quant was served
This is how I know you have no idea what you are talking about and resort to LLMs for all your argumentation. Benchmarks are gamed so hard that even quantized models would achieve on them non-quantized level reliably. Moreover, benchmarks (that matter) are not run continuously all the time.
> Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution. Despite having access to this data, no one has any evidence proving a quant has been served.
You really are an LLM. What do you think different token distribution means? It literally means different, arguably worse performance in coding tasks. The evidence is in your face, but you have to keep it straight, since you are an Anthropic shill. You wrote yourself an argument why the models ARE quantized over time and did not even understand it. Makes sense, since you are paid to not understand stuff but peddle LLM-hype for Anthropic instead.
> I'd explain the reasons I favor Anthropic over the others
It is perfectly visible why you favor Anthropic, because you are an Anthropic shill and they pay you your salary, duh.
> real conversation
This is the type of conversation everyone should have whenever they read something written by an Antrhopic shill. You are actively poisoning this forum by astroturfing for Antrhopic, so we should take measures against it.
> You haven't substantively debated a single point
Obviously an Anthtropic shill would ignore everything of substance I wrote and instead focus on being called out. Fortunately, it is not you who I have to convince of anything, since your very well-being relies on getting salary from Anthropic peddling LLM-hype on HN and elsewhere, so you are physically incapable of understanding pretty much anything that contradicts your talking points.
No, feelings are not reliable data when frequency bias and misinformation exist. There is a reason most experiments isolate out bias as much as possible.
> Moreover, benchmarks (that matter) are not run continuously all the time.
So there's no data?
> What do you think different token distribution means?
You clearly did not understand anything I said. Stated simply: If you were being served a quant, you'd be able to tell by looking at the token distribution, latency, and TPS. You don't need to trust the labs' word for it.
> they pay you your salary, duh.
In fact, I get paid by a FAANG, though I do use Anthropic products heavily. Further, I don't really need money, I have more than enough. So much for reading my history.
> You are actively poisoning this forum
Your degenerate discussion - calling people shills instead of engaging with the argument, insulting them when your arguments are disproven, your inability to hold a rational debate that's not angry and emotionally charged - that is what is poisoning this forum.
Frankly, if you react this angrily and emotionally to a simple rational premise (that frequency bias leads to the perception of models being worse than them actually being worse), you're ngmi unless you're already independently wealthy.
I would recommend a therapist, it helped me when I had similar behavioral issues. (Claude is a great therapist, by the way ;)
Nice gaslighting, Anthopic shill. No one said a word about feelings, only you (to derail the conversation). People reported their own experience and frustration with the model being unable to complete tasks they previously could. I said, get a better playbook before coming back. Or is it the best LLMs can do for now? Sad, then.
> No data
There is data, which you try to gaslight into being "feelings", Anthopic shill.
> Stated simply: If you were being served a quant, you'd be able to tell by looking at the token distribution, latency, and TPS.
Did you just repeat what you said before while ignoring the actual meaning of the words and my explanation of what YOU wrote? Is it what LLM told you to do, Anthropic shill? And you claim I have no substance. Maybe spend a week or so getting educated, before blindly copying and pasting LLM output, Anthropic shill?
> I get paid by a FAANG
Yeah, in your dreams maybe, Anthropic shill. I did read your comment history, and this is likely part of the story you try to build around your Anthropic shilling persona. Not a single fact that would prove that and believe me, I tried looking for it. Only endless claims of "I work at a FAANG" (no one who actually works here writes it like this).
> I use Anthropic products heavily
This is obvious, as 90% of your comments are LLM generated, Anthropic shill.
> calling people shills
Clanker, I called only you a shill, not people, tell your LLM to update its context. And I called you shill not because of any arguments, but because of your comment history unapologetically shilling for Anthropic and peddling LLM hype.
> arguments are disproven
You ignored half of my arguments, and for the rest you just repeated what you wrote before, not even understanding what the words you typed meant. Nice gaslighting, Anthropic shill.
> insulting
And you said you were not offended. Once again, Anthropic shill, being called a shill is not an insult. This is your fate, to be called an Anthropic shill, while you are on their payroll, astroturfing online communities with your LLM-bullshit peddling. Or do you expect being a propagandist to be a pleasant experience? People with no morals like you coming into this forum spreading their employer's bullshit deserve all the hate they get and more.
> you're ngmi. Hope you're already independently wealthy.
Your LLM outputs the same thing as in other comment for no good reason. Can't Anthropic afford good models for its shills, or is it the best SOTA can do now?
I would recommend you abandon this account, because it's now burned for all shilling intents and purposes.
> There is data
Please show it.
> while ignoring the actual meaning of the words
It was an incoherent mess of insults, so I am still not sure what you're trying to say.
> Yeah, in your dreams maybe
So now I'm lying about my employment on an anonymous forum for... what, exactly? If you are actually this conspiratorial IRL, get help.
Once again, putting "AI bad" into my words. No, Anthropic shill, this is not what I am saying. Is your LLM malfunctioning or are you not really getting it? Stop gaslighting, Anthropic shill, and try to stick to the actual words I am saying. I understand that this is hard for you, because then you would have no real argument to make, but please try, Anthropic shill.
> Please show it.
I used an LLM to count actual experiences of people reporting their experience with Opus 4.6 being degraded. There are literally several hundreds of such data points. This is data. People, who are employed and actually use LLMs for coding, unlike you, Anthropic shill, who uses it only to poison online communities. Are you really going to disregard all that to claim it is mass-psychosis or something? I guess you would, Anthropic shill, because that's your job, to peddle bullshit LLM-hype unbased on anything in reality.
> It was an incoherent mess of insults, so I am still not sure what you're trying to say.
Repeat after me, Anthropic shill: being called a shill is not an insult. You are a shill, stop being obtuse and at least take some pride in your work of promoting LLM-hype. So once again you are providing nothing to the conversation except for baseless accusations of insulting, which I did not do, and refuse to answer to the actual arguments I made. I can provide it again, but you would likely ignore it because it just showcases how you are clueless about the topic.
Your words, not mine:
> Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution.
I asked if you understood what "different token distribution" meant. I can tell you what it means: models performing worse at coding tasks. So people report models being worse at coding tasks, YOU write that indeed quantization leads to that and then just "forgot" about it? Nice level of "objective" discussion, Anthropic shill.
> So now I'm lying about my employment on an anonymous forum for... what, exactly?
It is not anonymous forum, as much as you would have liked it to be, so that your shilling could not be dismissed as easily, Anthropic shill. For what? So that people would fall for the bullshit you are peddling. Are you really this dense, Anthropic shill?
> This is data.
Nope. Because frequency bias is a thing. If you hear on Twitter "model X got nerfed," your brain will look for that pattern and notice it more than usual. This will then confirm your suspicion, which leads to a vicious cycle. Then you tell your friends and the same phenomenon repeats.
None of this requires the model to get worse. It's a well understood psychological phenomenon.
> I can tell you what it means: models performing worse at coding tasks. So people report models being worse at coding tasks
The perception of a model performing worse at some coding task is not what "different token distribution" means. You should ask AI to explain my comment ;)
Latency and TPS can also tell you if you're getting a quant.
Anyways you should really get some help. Praying for you!
Gaslighting again, Anthropic shill. What does frequency bias have to do with the objective fact that hundreds of people reported their own experiences with LLMs being degraded over a short period of time? The very same tasks that the very same LLMs could do, they no longer can? You seem to ignore this FACT, this DATA, and instead have to gaslight and divert into "frequency bias" nonsense. I do understand, why you are doing it, Anthropic shill, but at least have guts to admit it.
> perception of a model
You once again ignore what your LLM outputted and you typed yourself and divert into "perception", Anthropic shill. You do not need to sample entire output for tokens to notice the distribution moving. If the LLM used to be able to achieve set goals and no longer could, it is already a sign of the distribution shift. And as you said yourself, different token distribution = model being quantized. Which is reported in hundreds of separate instances. Which is more than enough to conclude that the model was, in fact, quantized, and no amount of gaslighting can change that. But you are an Anthropic shill, so you have to peddle your bullshit, trying to twist facts to support your employer's narrative. And you deserve being called out on that, Anthropic shill.
This isn't hard to understand
"Model is nerfed" claim hits social media
Someone else sees it, frequency bias makes them think their model is also nerfed, and they amplify the claim
Now it spreads, like a virus, even if the model never changed
Social dynamics like this are well understood psychologically
> If the LLM used to be able to achieve set goals and no longer could, it is already a sign of the distribution shift.
The more likely explanation is that you're looking at older LLMs with rose tinted glasses, and misremembering what it could achieve
Otherwise you could measure the token shift and see the better tps and latency
Your own evals would trend down
But no one, not one person, has presented empirical evidence of being served a quant. Just vibes.