I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.
With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.
After 2 weeks of Claude getting progressively worse and worse, today was the final straw.
I don't care if they have a phone app. The model is COMPLETE garbage after you subscribe long enough and they think they've "got you".
I can't code on my phone if the model literally moves in the wrong direction and does the opposite of what I tell it to. If I wanted to make my code worse, I'd just randomly commit garbage. I don't need a mobile app for that.
I've seen a lot of this sentiment over the previous six months from people on reddit. I have yet to experience this myself as a developer with over 20 years of experience.
As always, I think this happen more to vibe coder. They don't understand that bigger project means worse AI performance. On top of that Opus felt being nerfed at understanding prompt so if your spec is bad you won't get good result.
I see a lot of the "4.7 is a downgrade" sentiment. 4.7 does (mostly) what you ask it to do. 4.6 does what it thinks it should do. As someone with 20 years writing my own code I want the former, but the loud contingent online wants the latter.
When you're on a mature codebase with 500k+ lines of code, I haven't seen anything else be as effective as 4.7.
I can tell you for a fact, Claude 4.7 was NOT doing what I told it to do (in fact the clear and complete opposite - repeatedly), a pretty simple architectural refactor, and that Codex did better and DeepSeek much better.
It was given very simple ways to verify success. It simply didn't do that and said it's at a good stopping point, despite moving in the WRONG direction not even doing 1% of the task, and being told to see the task through to completion.
Meanwhile, Codex broke it down into 3 steps and just got it done...
No, "I'm going to give it to you straight, this is a large risky commit that could go sideways, so I'm just not going to do anything instead."
Claude worked on it for almost 200 commits over 2 weeks, needing to typically prompt it 3x to even TRY to make any progress instead of just wasting tokens to ignore me and tell me how big and risky it is.
Maybe Claude is just particularly terrible at this type of refactor. I'm not sure why that would be.
Opus 4.7 has been a real downgrade for me. I’m back to mid 2025 when I had to catch all the completely intermediary goals/assumptions the model is creating for itself
it's sort of good at thinking, writing specs, etc.. Also debugging. But as a coder: I see no advantage to opus 4.6 and I preferred sonnet most times already over opus 4.6.
My "mental scratchpad" needs to be as sharp as possible to maximize my intelligence. I think of the LLM as a scratchpad for my thinking, I hope the Anthropic team can see this.
Oh Opus is nerfed sure, but not that hard. Early this year opus 4.6 can understand your prompt and your intention easily, it got worse around mid April. Opus 4.7 even worse than that.
However that's just it, you just need to improve and make clearer of your prompt and it will perform just as good.
This account is an LLM-hype peddler, shilling for Anthropic (check comment history). If they say that Claude is not nerfed, then most likely it is, in fact, nerfed.
I wouldn't call correcting misinformation and FUD "peddling hype" or "shilling" but I suppose we are in a post-truth world, where if you push back against the anti-AI emotions and vibes with grounded facts, you must be a shill.
Anyways, please take your discourse of calling people you disagree with "shills" back to Reddit. I'd much rather engage with someone debating the merits of an argument.
If you are an LLM-hype peddler, you really should not be offended at being called out. Also, this is the merit you are ostensibly looking for — since you are a shill, everyone should know this first before taking your words seriously.
You should also check your LLM prompt for HN comments, because the original comment you replied to was not anti-AI, and, in fact, very much pro-AI. The only criticism it had was about model being degraded, so they could not go as hard at AI-assisted development anymore as they used to before. I guess it's a bit difficult for LLMs to spot the difference and make proper conclusion for now.
Also even if taking you seriously — how does writing "no, model performance is not degraded because I say so" serve as correcting misinformation? It only does if you are shilling for Anthropic (which you do), otherwise it's just hot air.
Not offended at all, but just ranting about how someone is a shill instead of responding to the substance of their argument is simply not the kind of discussion we have on HN. Read the guidelines.
> "no, model performance is not degraded because I say so" serve as correcting misinformation?
Because zero evidence has been provided other than feelings. That is not evidence of degradation, and we know they don't serve quants.
You are an Anthropic shill, and this is an explicit marker that needs to be added to all of your comments, so that all information you provide can be adjusted for that bias. But I do understand why you ignore this point since it devalues all your comments (as it should), and instead cling to "ranting how someone is a shill bla-bla-bla".
Those people, unlike you, are actually using AI in development. And it is not a singular person who reports their frustration with the model being degraded after a certain period of time, so the anecdata does gradually become data. Your attempts at gaslighting are weak, you should really ask your bosses for a new guidebook on how to deal with reports of models performing at worse levels than before. Just writing "because I say so" is not cutting it.
> "we know they don't serve quants"
How do you know that unless you are working at Antrhopic? Yet another evidence of you being an Anthropic shill.
You have no substantive arguments other than calling people you disagree with shills.
> so the anecdata does gradually become data.
No, it does not. Countless social phenomena demonstrate how factually incorrect misconceptions spread rapidly. Frequency illusion is real and contagious.
> How do you know that [they are not serving quants]
Lots of ways to tell, if you weren't busy calling people shills.
First, Anthropic and OpenAI have both stated they don't serve quants. Weak protection, but it's there.
Second, no one has shown an A/B or eval proving a regression.
Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution. Despite having access to this data, no one has any evidence proving a quant has been served.
> You are an Anthropic shill
I'd explain the reasons I favor Anthropic over the others, but you'd just go back to yelling "shill" instead of engaging in a real conversation. That said, I am a fan of GDM as well, and think Gemini is better than Anthropic for everything other than code.
I've seen nothing resembling sane, reasoned thought from you in this thread. Just anger.
You haven't substantively debated a single point, it's like "shill" is the only word in your vocabulary. Again, this isn't Reddit.
What it does seem like is that they're tuning some knobs up and down or releasing new versions of models or system prompts that result in the model getting dumber and smarter in waves.
Opus has been dumb this week.
Claude was having a lot of capacity problems and downtime and then this week that has been much less obvious... and the model is dumber.
It could also just be luck and my impressions are false... who knows.
It’s because it’s not true, there’s no evidence for it that passes the sniff test. No lab is “shipping a worse model once they’ve got you”. People have a bad few days and blame the model providers instead of stepping back to fix their workflow.
Tell it to make changes, amend the commit, push --force-with-lease.
I'm attempting to make a memory safe language like Rust but with a substantially lower learning curve and added safety (but non-zero cost abstractions) fully with AI, almost entirely from my phone, commuting, getting coffee, walking the dog, between sets at the gym, replacing doom scrolling before bed and during lunch, etc.
Mostly to test how much LLMs can actually scale development.
Depending on how long it takes them to clean up some architectural slop in the MIR lowering phase, the results could either be very impressive or not.
From a purely cost basis perspective, it's hard to argue they aren't killing it.
But from a multiplier perspective, it's up in the air how great they are.
It's proven to be a really nice experiment, because much of what I wanted to solve with a language is the problems inherent to LLM development.
So at the self hosting phase, I get a great opportunity to see if the language can actually deliver on what I dream for.
#1 -> part of scaling is you can't review every single line of code.
LLMs don't really scale if you're still the bottlneck, or they only scale as much as you reviewing every line of code - that's not that much scaling...
So I try to only review certain parts, like making sure they aren't changing tests to allow architecturally broken code to slip through (because they regularly try, even when given explicit instructions not to). Or if I'm watching them make changes on my phone and see that they are clearly doing the exact opposite of what they're supposed to be doing (regularly if I'm watching).
#2 -> if commits are small, GitHub's setup is good enough that you can review code on your phone.
#3 -> if they're huge, I can just review on my laptop at lunch or something.
Theoretically, all of this can be solved easily with orchestration and require minimal oversight.
If you're using LLMs to write code and you're carefully reviewing every line with a jade-handled magnifying glass, you're not really scaling - at least to the degree I'm interested in.
> LLMs don't really scale if you're still the bottlneck
This only works if there's no consequences if your code breaks. In the eyes of other humans you're responsible for what you commit. No amount of "scaling" will change that.
Gemini got a big reduction in usage limits this week. There was backlash and they added 3x usage for Antigravity a day later but I haven't really tried it out to get a feel for it yet.
Google rug pulled Code Assist and Gemini CLI. They're moving everything to Antigravity and we would need to reinstall all our tooling, reconfigure any automations, and the mechanism to subscribe via GCP is much clunkier.
This was all supposed to be worked out prior to Cloud Next, but it wasn't. Ironically, they mentioned Claude in a few of their presentations at next.
And that was our solution. We are a big GCP customer but our whole team is on Claude now and much happier.
I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.
With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.