did miss it until now, cool to see it on device and first party. as soon as it lands I will see the impact on apfel.
but i definitely feel flattered, either my little project inspired them or that I reached the same conclusion at a similar time as a team at apple that "hey, this is totally missing"
dofm 17 hours ago [-]
Is ‘fm serve’ OpenAPI compatible?
franze 15 hours ago [-]
fm serve - "Start a Chat Completions API server"
chat completion is openai's api surface name.
but only when it is actually available we will see if it's a clean drop-in vs. just "chat-completions-ish".
one of my learnings from apfel is that is is very easy to get a kinda openAI api compatible server, and a lot of work to get it really totally compatible. sometimes i wonder if even the openai implementation of openai's api is openai api compatible to the core....
dofm 15 hours ago [-]
> chat completion is openai's api surface name
Ahh! I did not know that
> sometimes i wonder if even the openai implementation of openai's api is openai api compatible to the core….
It's a similar situation with "Arca-Swiss compatible" tripod plates in photography. There is really no such thing — Arca-Swiss didn't make a standard, so they didn't have to stick to it themselves, and while most things using this "standard" fix to most things, some things just won't fit, or won't stay put. Everyone implements it, and if they don't, people complain "why didn't you just put an Arca standard foot on it?" and then you have to sit them down and tell them.
ElFitz 18 hours ago [-]
Oh, neat. Totally missed it, thanks!
harrouet 17 hours ago [-]
Wait, is PPC open bar ?
robgough 12 hours ago [-]
They've said there are limits, and increased limits for those on iCloud+ ... so it seems that Apple is in the selling LLM access game now. I don't think there are any details yet on the nature of those limits, and whether they can be increased as required etc.
ABS 12 hours ago [-]
[dead]
jjice 10 hours ago [-]
Agreed. The idea of a system wide (and platform wide) on device model being a core part of OS APIs is very appealing. I do like my software more piecemeal, generally, but when it comes to Apple, I really love a lot of the out-of-the-box offerings they have. Just giving software access to something they know exists on these platforms and can use for various small (and likely increasingly large) gen AI tasks is so appealing.
mark_l_watson 10 hours ago [-]
Thanks apfel looks useful! I have been experimenting with Apple's foundation models for almost a year and they are useful for embedded applications. I have been taking a deeper dive into local agentic coding tools (starting with 'little-coder --model ollama/gemma4:12b-it-qat') and I put together a tiny free book with some setup advice that might save people a few minutes of setup time: https://leanpub.com/read/local-coding-agents
I have been fairly much pissed off at the "hype in hyperscaler" AI growth (data center environmental and other societal costs) and I support anything we can do to promote local and private AI.
dofm 17 hours ago [-]
Are you surprised they apparently didn’t adopt your idea and add an OpenAPI compatible endpoint in Core AI, even if just as a testing tool? I am.
I also really want to hear more about their containerisation/seatbelt strategy now that they are offering MCP support. Not seen any news about Darwin inside their containers system.
(Apfel is a cool project; it’s been the only thing tempting me to upgrade to Tahoe)
crancher 1 days ago [-]
Apfel is very useful, thanks for the effort.
cat5e 1 days ago [-]
I second this, I’m more excited about dumb local models than something I could never run locally.
trollbridge 1 days ago [-]
Thanks for building this! Something I grab on a regular basis, especially for doing simple education of folks about the basics of using LLMs by showing something that's not just a chatbot.
mips_avatar 20 hours ago [-]
Seems like they still won’t let you run models on GPU while the phone is closed or the user switches apps
tyre 19 hours ago [-]
This is good. Apps would not be respectful and end up draining users’ batteries to zero in no time.
"If your app uses model types other than neural networks, such as decision trees or tabular feature engineering, see Core ML."
trollbridge 1 days ago [-]
This is just a bit exciting, although I wonder how the performance of this will stack up next to the stuff we already do with, e.g., a metal-optimised model which we then load into llama-cpp or whatever. (unsloth is a good example of doing this for you "batteries included").
ElFitz 18 hours ago [-]
A few months back someone reverse-engineered private ANE APIs and shown some significant performance improvements compared to CoreML and Metal, on both inference and training.
seems they planning to replace it but overall now I'm really confused about this and mlx and coremltools. They should do better work explaining the benefits (and cons) of it and any feature parity between coreai, coreml and mlx.
ABS 12 hours ago [-]
looks to me like the docs don't give a feature-parity table, but they do draw the "role" lines once you read across them:
- Core ML narrows to classic, non-neural ML (its own docs now point you there for "decision trees or tabular feature engineering")
- Core AI takes neural nets and transformers (the new .aimodel format, the new
profiler)
- MLX stays the separate bring-your-own-weights track (its WWDC sessions draw no line back to Core AI at all)
coreai-opt is the successor to coremltools on the optimization side.
LoganDark 23 hours ago [-]
My reading of it is:
- Core ML is for models designed only for Apple platforms
- MLX is for models that don't need to be fast
- Core AI is for models that run everywhere already and also need to be fast
jkman 9 hours ago [-]
This view is a bit off. First, keep in mind that MLX was and will not be able to access the ANE, so it's a total non-starter for anything user-facing. Based on updates to coreml docs, they're trying to sell CoreML as the tool for tabular or domain-specific applications and CoreAI for NNs moving forward.
wahnfrieden 17 hours ago [-]
I use CoreML for models designed for other platforms. I port the models to it but it works for that without much trouble.
MLX is not for end user deployment.
wahnfrieden 1 days ago [-]
Requires OS 27+, so CoreML is still useful for backwards compatibility.
sgt 13 hours ago [-]
macOS users aren't that good at upgrading regularly, but iOS users are at least obsessive about upgrading to the latest OS. I guess the system almost forces us.
jjice 10 hours ago [-]
The workaround my friend uses (unintentionally) is being completely out of storage on her phone.
wahnfrieden 8 hours ago [-]
I still deploy CoreML features to iOS 15. Many devices in use can’t upgrade to 26/27
sgt 6 hours ago [-]
Of course, you caught me thinking like an Apple engineer. I forgot about older devices in circulation, and that's super important.
scosman 19 hours ago [-]
Free server-size model access for apps with <2M downloads, getting the same privacy guarantees. Hopefully they scale this up to all apps in time (I assume hardware/cost constrained, but larger devs would pay).
My guess based on the Apple Intelligence Extensions mentions is that they will not scale that up anytime soon, but they will allow developers to integrate with other providers that the user has an account with.
ABS 12 hours ago [-]
something I haven't seen highlighted anywhere yet, while I find it very interesting, is the distributed inference across Macs (JACCL over Thunderbolt 5), an OpenAI-compatible mlx_lm.server, agentic-on-Mac.
Apple keeps MLX (bring-your-own-weights) separate from Foundation Models / Core AI.
dvt 23 hours ago [-]
AI future is clearly local, and my recent pitch has been "infinite tokens." Because that's what my M1 MBP can do; and that's what my RTX3090 can do. I don't need to pay hundreds of dollars a month and no one else does either.
pmontra 10 hours ago [-]
In the 80s we thought that the future of computing was clearly local, home computers, PCs, Macs, the office server (Novell, then Windows NT with disk shares) etc. Add 40 years and we are back to a centralized infrastructure with the modern equivalent of smart terminals.
The AI future will be clearly... what it will be. Probably bouncing back and forth from local to centralized. However, if there are money to be made by selling things that people run locally, it seems that centralizing creates more power and hence more money.
ip26 19 hours ago [-]
Infinite tokens rate-limited to 10 tok/s is 26MTok per month.
doctorpangloss 18 hours ago [-]
10? think closer to 5. 13M is like ~7 codex sessions…
22 hours ago [-]
fedeb95 15 hours ago [-]
the real money is in the coding surrounding models to make them efficient at specialized tasks. Casual users want general purpose models, and AI chat apps will stay for them. Most programs can benefit from a specialized AI that can be local, and #programs >> #users.
AdamN 13 hours ago [-]
Also context - there's alot of context out there and it's faster to get it from servers.
It doesn't matter how good the model is if it doesn't have context from data sources.
ankit219 21 hours ago [-]
they are also working on activations (w4a8, w4a16 from what i know). if they deliver (and a big if), it means that given their market reach, they can dictate the way sub 100b parameter models are trained and served to a large extent, given their major usecase would be on device (macos and not ios for most of them).
ABS 12 hours ago [-]
[flagged]
an0malous 1 days ago [-]
This is why the AI companies are rushing to IPO. By the end of next year you’ll be running most of your AI on device. They have no moat, they’ve reached the limits of scaling, most of the magic can be distilled into smaller models, and they know it
hadlock 1 days ago [-]
Qwen's ~30B-class models are genuinely good enough for use if you can find a machine with enough memory bandwidth to run them at 30-90 tokens/second. It's been extremely telling that Qwen stopped releasing 120b class models. At some point in the next 10 years (maybe 3?) someone is going to release an Opus 4.5 class 256B model you can run locally. Right now our engineers use about $800/mo worth of opus tokens; at that rate the ROI for local LLM is ~10 months
horsawlarway 22 hours ago [-]
I want to echo this.
I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware.
I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf.
Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models).
It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement.
I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range.
PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines.
dofm 16 hours ago [-]
I have come at this at a slightly different angle.
I am a fully-burned-out freelancer (in the last couple of years so severely and totally that I thought I had early onset dementia, and I am still not sure I don't). I don't really have an off-ramp to anything else yet, but the sea-change in the industry has been contributing to my feeling that I should knock it on the head.
I must get past broad understanding of AI to deep understanding, but I have to find a way to do this which sits well with freelancer ethics (sustainability, stability, control of destiny).
So I decided I would start out with that operating principle that ultimately this stuff is just going to be local: models will eventually hit some level of practicality for most tasks and technological progress guarantees that they will eventually run on desktops.
I decided to learn how to run models locally properly, see how far I get with opencode (and Pi and Zed experiments), and grow outwards from there to metered models (opencode go, openrouter etc.)
Knowledge first; what can I do that meaningfully changes my outcomes and confidence with no cost and no exposure to sudden change?
I have a secondhand M1 Max (excellent GPU bandwidth), and I am really shocked to find that arguably that level of practicality is already here.
Qwen 3.6 35B can really do a lot. And — not sure if you have tested it — but in some ways I think the Gemma 4 26B is better. Particularly for more commonplace dev tech — it is very knowledgeable about the sort of low-end web dev stack that is most common (Wordpress, PHP, MySQL).
I have been getting 75 tokens/sec with (GGUF) Gemma-4 26B QAT and MTP. (Can't get anywhere close with MLX, for some reason.)
A similar sort of speed with an MLX Qwen 3.6 35B. I have a sneaking suspicion that maybe llama.cpp is now faster than MLX on this older kit so I might try seeing what llama.cpp can do there, too.
Not blazing fast, but fast enough that there are plenty of experiments and small jobs I can do before I even get to using Big Pickle!
weirdcat 11 hours ago [-]
How are you running that GGUF, and how many tokens/sec are you getting without MTP? My M1 Max gives me 65 t/s for non-MTP unsloth/gemma-4-26B-A4B-it-qat-GGUF (UD-Q4_K_XL), but with MTP that actually goes down to 56 t/s (at 63% accepted drafts).
dofm 9 hours ago [-]
Just this guy's assistant running against the official Q4_0 GGUF:
I hadn't done any really radical testing so I've just had another look.
Without the MTP drafter, it is pretty consistently 75 tokens per second anyway, which is interesting.
With the MTP drafter it reaches well above 95 tokens per second handling the prompt and it will slowly drop to 65 or so with the output tokens as the prediction success rate slowly drops.
But with generated output it seems to me that the predictions are always going to drop dramatically over time.
I think my results here are broadly consistent with what people say about success rates with smaller and sparse models. I am going to test with n-max 4 in agentic situations at some point, and I may see whether it has much impact on the 31B model which is too slow to be practical otherwise.
I have a very unqualified feeling that MTP will matter more in agentic coding because of the larger prompts.
But my biggest issue since I installed it, I think, is that the combination is occasionally messing with markdown generation during thinking, and sometimes possibly losing the </think> at the end. I've seen it enough now to be fairly sure it is the Gemma MTP causing it. There is an open bug in the vLLM project about this and I wonder if something similar is going on in llama.cpp.
The speed without the MTP drafter is pretty solid so I am content to let more experienced people than me handle things while I learn other stuff, but I might go looking for some testing code that can prove it sometime.
Majority of my agentic setup is pi / Claude code where every single Chinese models are not as good except commercial 1T models .
Local is a pipe dream . If you can run it cheap occasionally why commercial companies can’t run it cheaper 24/7 and lower the costs ? The answer is simple. Use cases are more demanding and hence you need more from model not less .
Sure if you task is to do a narrow labeling task on 1m records small optimized model is good . If you want to do complex things , it shifts with models advancements
horsawlarway 10 hours ago [-]
Because I have a fixed expenditure on my local machine, and I can be absolutely sure of the costs over a long horizon (5+ years, for low end hardware life, 10+ years with moderate care). Not something that's true for cloud costs.
Your argument is actually really similar to an argument around the time Uber started kicking into gear and expanding.
It went:
---
"Why should I own a car when it's actually cheaper to just Uber for all my rides, compared to the cost of buying, maintaining, and insuring a car?"
---
And that wasn't an insane argument at that exact moment. Uber was pricing itself in the range of $5-$7 a ride, was novel and high quality.
Except take a look around today... Uber in my area went from ~$5 a ride to ~$27 a ride for the same trip. Uber's quality has also degraded quite a bit. It went from primarily high end, new cars with immaculately clean interiors to "average".
So want to make a wager on what's going to happen with cloud costs over the next decade for inference?
Because my strong hunch is they're going to follow exactly the same trend. They will stop being subsidized, providers WILL downgrade model quality to improve operating costs (and you'll have no control over this outside of enterprise contracts), and companies will start exploring "additional revenue options"... which means they'll shove ads and sponsored content into your results.
Is it worth being ~10-18 months behind the latest and greatest to avoid that entire set of shenanigans? I'd vote yes... I pay one time up front, and get usage limited by my hardware for the cost of electricity over a 10 year timeline. That's a decent deal with no surprises.
You're welcome to rent, but renting makes you subject to the whims of the owners. They're being very nice right now to attract all the flies. That's not a mistake, and it's absolutely a trap.
---
Side note - if you're only able to do labeling tasks with a local model... you're holding something very, very wrong.
hparadiz 21 hours ago [-]
This sounds like something someone at IBM in 1986 would say trying to sell their mainframes. "PCs will never be a thing. No one's gonna want a computer."
I'm seeing some impressive results from folks that can afford 10k+ GPUs right now. But those GPUs will all be hand me downs in 10 years. So pipe dream? Hmmm...... that's not how this industry works.
tyre 19 hours ago [-]
Those are not GPUs available on iPhones. Will we get there eventually? Maybe! Maybe we end up with GPU clusters built on the edge (e.g. cell towers) for offloading, maybe it’s never economical, maybe a different model architecture makes it simpler, who knows.
But it doesn’t seem anywhere imminent with our current world state.
hparadiz 19 hours ago [-]
My computer is 15,000 times faster and costs in inflation adjusted dollars half that of my computer in 1995. There's zero reason to think that won't happen over the next 30 years again.
For whatever reason every generations thinks they are the peak. Naw man. You're just a blip at the bottom of the logarithmic chart.
sgt101 17 hours ago [-]
For me there are a bunch of questions:
- was the pause in model scaling a result of the benefits of RL & SFT being easier to access and quicker than scaling, or was it genuinely the result of scaling being low ROI now?
- are power densities necessary to provide high quality on device inference possible? Can the best, technically feasible, architectures accomodate T scale models and run them off batteries that fit in your hand?
- will thing slow down enough to allow edge depoloyments to realise value vs. centralised deployments.
- do edge use cases drive enough revenue to get this to happen?
- can local inference make up for model scale? Does that make sense in a latency/power race with the central infrastructure? Is there a sweet spot here?
I am not sure about any of the answers...
wqaatwt 17 hours ago [-]
It has slowed down massively for CPUs at least. e.g. modern CPUs are hardly more than 3-5x faster than those from 10 years ago. There is zero reason to think won’t happen over the next 10 years again.
horsawlarway 10 hours ago [-]
This isn't an crazy statement (cpu performance metrics have mostly stalled their meteoric rise from prior to the 2000s)
But it also doesn't capture the entire picture.
CPU metrics mostly stalled for two reasons.
1. There wasn't much demand for the extra capacity. Even low end cpus from a decade ago are plenty capable for just browsing the web and typing up documents. It takes a novel use-case to drive demand again (or a desire to do things like play new games).
2. The interest in CPU development shifted in response to mobile. Given point #1 and the state of battery development.... the blocker wasn't "performance". It was "performance per watt". And on that metric you couldn't be more wrong.
Since ~2005, MIPS per watt has improved 15x to 30x.
Also - fun news is that the traditional CPU pipeline really isn't the bottleneck for AI workloads. So we're going to see incredible interest in things like memory bandwidth and other inference related hardware bottlenecks, which haven't already been optimized.
BeetleB 9 hours ago [-]
> There wasn't much demand for the extra capacity. Even low end cpus from a decade ago are plenty capable for just browsing the web and typing up documents.
It stalled before the rise of PC-as-Internet-portal.
I bought a high end PC in 2003, and 5 years later the PCs were not much faster - probably not even 2x. Around 2008-2010 was when most people started using PCs as a way to connect to the Internet.
It stalled because scaling got a lot more challenging. Not because of lack of demand.
horsawlarway 4 hours ago [-]
Yes, but it only stalled along a single dimension - Single core clock speed.
I was building gaming machines in the early 2000s, I absolutely remember the 4ghz wall that cpus hit.
But it wasn't a real wall... because we then got one of the arguably most influential processors ever in the Core 2 duo. Which... blew the limit away by giving you two processors clocked at 2.93 GHz each.
And honestly, even then - it was lack of demand (we could go to 4+ghz, but we didn't want to pay the power bill for the rest of the system - the planned pentium 5 was 7-10ghz on paper, but they canceled the project because keeping it fed and cool was too hard for personal desktop machines).
Of Note - we did reach these speeds on consumer hardware (ex - in 2012, Andre Yang hit 8.794Ghz on an AMD FX-8350)
So it was never "impossible" to keep scaling. It just wasn't worth it compared to going multi-core.
---
And maybe it's because I was in my formative years at this time, but you're off by 5+ years with this:
> Around 2008-2010 was when most people started using PCs as a way to connect to the Internet.
Gmail was a web only email client released in 2004. Wikipedia was released in 2001. Web browsing was very much one of the "killer" apps for computers by the 2000s. What do you think the damn 2000s dot-com bubble crash was?
at the risk of aging myself - I was born in '89, and I literally do not remember a time where we didn't have DSL speeds and above (friends houses often still had dial-up until ~2005, though).
BeetleB 2 hours ago [-]
> Gmail was a web only email client released in 2004
Well, Gmail was actually one of the last web based email clients people used :-) Yahoo mail, Hotmail, and so many others predate Gmail by years.
> Web browsing was very much one of the "killer" apps for computers by the 2000s.
One of them. People still used non-browser apps for all kinds of things: Media consumption (people didn't watch movies on Youtube), Office (Google Docs was very much a niche thing for many years), photo-editing (lots of pirated versions of Photoshop/Lightroom years after the iPhone release), etc.
Most non-mail, non-social media, non-shopping stuff people do on the web these days was a dedicated SW from the vendor in those days. Want to make a photobook? Download this Windows binary and set it up there. It will then communicate with the server for the order (no browser utilized).
> at the risk of aging myself - I was born in '89, and I literally do not remember a time where we didn't have DSL speeds and above (friends houses often still had dial-up until ~2005, though).
Spring chicken! My first online experience was on a 340 baud modem :-)
nozzlegear 7 hours ago [-]
Depends on what you're doing, of course, but for the small and focused tasks where I'm using agentic AI, local models on my M1 Mac Studio are superb.
iwontberude 11 hours ago [-]
Keep working on your agentfu because there is a sweet spot with subagents and parallelizable plans. It’s not about better, it’s about efficiency and picking the right model for the job. You can achieve the same results as frontier models with the right type of planning and context management on local Chinese models.
iwontberude 11 hours ago [-]
I was freaked out being stuck with OpenAI and Anthropic. I setup qwen3.6:35b-mlx on my Mac Studio M1 Ultra and was blown away really. I am no longer afraid that Anthropic or OpenAI will be able to control the market.
strictnein 22 hours ago [-]
Didn't Qwen stop releasing their more powerful models because they're commercializing them?
mswphd 20 hours ago [-]
Yes and no.
Qwen 3.5 was released 3/2/2026. It includes models up to a 397B-A17B model
This does not include any particularly large models. But the models it contains (Qwen3.6 27B and Qwen3.6 35B-A3B) are the local models people have been very excited about lately. So they didn't release any larger models, and the models people praise so much are from this most recent release.
tyre 19 hours ago [-]
If they stop releasing their larger models because they want to monetize, would we expect them to release better small models that can outcompete those?
sealeck 1 days ago [-]
Have we reached the limits of scaling? Sadly it appears that larger model still equals better model
mikestorrent 24 hours ago [-]
Well, let's not forget that text models are not the only models! Video models are much slower and need comparatively more resources, and all they can do even at that size is generate videos a few seconds long. Clearly a ton more work is going to go into those, and demand for them will probably increase as more creative tools get authored using them as a central part of the workflow. Low-res local rendering for preview might be a thing, but the lion's share of the work for high-res, near-realtime rendering is going to be done on huge clusters for a long time yet.
niek_pas 13 hours ago [-]
This is definitely a good point. I imagine the max capacity for video models is significantly lower than for text models (there just aren't as many professionals in video as there are people who write text or code) but I could be wrong.
pixelready 1 days ago [-]
I think there’s still an open question around are the ultra-large next-gen models worth it? For those of us without early access to Mythos, it’s hard to verify whether it’s been held back from the public due to actually being “too dangerously powerful to release yet” as implied or because the gains aren’t outpacing the costs.
mindwok 24 hours ago [-]
I think GPT 4.5 showed that there is indeed a practical limit we're close too. That was supposedly a high-trillions of parameter model that was deprecated almost immediately because it was slow, insanely expensive, and had questionable benefits over the smaller models. Though apparently the new Mythos and whatever GPT Spud is (if it wasn't 5.5) are back up in the high trillions.
XenophileJKO 24 hours ago [-]
Actually having used it a bit, I'm quite excited to see a modern model of similar size.
I think what people didn't realize was, just because the GPT-4.5 model didn't get better on the benchmarks, didn't mean the model wasn't different than the earlier models. It was being compared to thinking models that were being developed at the same time.
The GPT 4.5 model still has some of the most "human" like abilities in communication even though it isn't particularly good a problem solving. It hadn't under gone the same type of reinforcement training.
I still use GPT 4.5 sometimes, in creative exercises it can be surprisingly effective. The model is still available.
adgjlsfhk1 21 hours ago [-]
yes and no. We've reached the point where larger models are higher quality, but they're also too expensive and slow to be used broadly. The giant models, however are still useful for training smaller models that are actually deployable.
stogot 1 days ago [-]
It’s still diminishing returns yes? It isn’t Moore’s Law
hajile 8 hours ago [-]
In the coding realm, I think we'll be seeing 35, 70, and 150B models sold where you pay a few hundred to a few thousand dollars up front and get a year of monthly/bi-monthly updates where they've trained it on new coding documentation and repos.
cat5e 1 days ago [-]
Huzzah, they’ve lost their stranglehold. Viva la revolution!
viccis 23 hours ago [-]
I just want a tiny tiny model that runs on device that knows for autocomplete that, for example, I want to say "I'll be right back" instead of "I'll be right Brian". That's my #1 AI ask right now. Please, Apple.
cush 23 hours ago [-]
I want Siri to let me “add to my calendar, dinner Peter’s house Sunday at 5pm” and not assume the location is the restaurant called Peter’s House in another state. It’s astounding how poor Siri is at using the data I’ve given it access to
maxdo 21 hours ago [-]
Why on earth I should switch from a top tier model to much worse local model ? Why do I need to suffer my battery ?
truncate 19 hours ago [-]
You can switch to local models for tasks/use-cases where you don't need top tier models.
romanovcode 14 hours ago [-]
Right now there is no reason since tokens are subsidized heavily. However when OpenAI/Anthropic will drop the $200/month pricing since most likely it eventually will become unsustainable you'd rather get MacBook Pro M6 Ultra with 128GB ram and go local then pay thousands every month for tokens.
ActorNightly 24 hours ago [-]
Very false.
I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.
Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
dvt 23 hours ago [-]
> Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.
Maybe not per watt, but unless you already happen to own a 3900 cited by that post, you'd have to buy that as well, which is currently selling for around $1400 used.
strictnein 22 hours ago [-]
3090s are running $1400 now? Wowsers. I thought I was overspending when I bought 6x of them for around $800 a pop.
Might be time to sell, to be honest. It's fun to have that at home, but I can't justify having $10k (with memory, mobo, cpu, etc) sitting in my basement without being fully utilized.
karim79 21 hours ago [-]
I'll take two of them. A thousand a piece.
dvt 23 hours ago [-]
I do have a 3090 Ti on my gaming PC, but even my old M1 MBP (with a mere 32gb of RAM) is quite competent and can run a quantized `Gemma4-26B-A4B` in the background while I do other stuff.
ActorNightly 14 hours ago [-]
The MBP running Gemma4 is absolutely is useless for any real work.
nozzlegear 7 hours ago [-]
What is "real work"?
ActorNightly 4 hours ago [-]
Where you are developing software. Its significantly faster to use google gemini and copy paste code back and forth compared to having gemini edit files for you.
ClikeX 15 hours ago [-]
To be fair, I can also use that 3900 for other things locally. Not just AI.
davnicwil 24 hours ago [-]
well to be fair that's right now, I think the question is what about in 6 months, 12 months, 2 years?
Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?
Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.
ActorNightly 23 hours ago [-]
>Where do these improvement curves go?
Nowhere.
Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics
For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)
Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.
The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.
iwontberude 11 hours ago [-]
Just need a decent Mac Studio and they are plentiful in used condition and affordable.
wyager 22 hours ago [-]
> By the end of next year you’ll be running most of your AI on device.
I expect I'll probably keep paying for whatever badass high IQ model is running on inference servers at that point
zombiwoof 22 hours ago [-]
[dead]
JV00 18 hours ago [-]
Does it mean I can run whatever I want on ANE? Last time I tried it seemed it could only be used by first party features such as Face ID
jkman 9 hours ago [-]
You've already been able to do that once you convert your model to CoreML, it's only MLX that's never been able to use the ANE.
wahnfrieden 17 hours ago [-]
Been doing that for years with CoreML
criddell 24 hours ago [-]
Is there something like this on Linux? For example, if I’m an application developer can I assume GNU Core AI (or whatever it is or would be called) will be there if the kernel is >= some particular version?
wtallis 23 hours ago [-]
On non-Apple platforms, you generally have at least 2+(number of supported silicon vendors) different AI frameworks to worry about. I guess Apple's there now too, between Core ML, MLX, Core AI.
I haven't seen any sign that the framework fragmentation problem is going away anytime soon. NVIDIA wants everyone to do all training and inference with CUDA and to deny that NPUs have any usefulness. Everybody making an NPU has a different framework tailored to their architecture and the limitations they inherited from hardware designed before LLMs existed, and most of them have a another framework for targeting a GPU. And the OS vendor has one or two frameworks they would prefer you use rather than something hardware-specific.
nl 23 hours ago [-]
For practical purposes llama.cpp is this. You can link to it or use the network API.
halJordan 20 hours ago [-]
No there isn't. RedHat and IBM do though, for their distros
teravor 18 hours ago [-]
onnxruntime, llama.cpp (more specifically, ggml), iree.dev is also trying
connectsnk 22 hours ago [-]
Do we know what is the underlying model? Is it a custome model developed by Apple or one of gemma/deepseeks under the hood
jacobr1 21 hours ago [-]
The new siri models will be some variant of the gemini models. This framework seems to be more generalized than that though.
but i maintain https://github.com/Arthur-Ficial/apfel so i might be biased
Here's what you get when you run it... https://gist.github.com/robgough/7893602895e7580117475076198...
but i definitely feel flattered, either my little project inspired them or that I reached the same conclusion at a similar time as a team at apple that "hey, this is totally missing"
chat completion is openai's api surface name.
but only when it is actually available we will see if it's a clean drop-in vs. just "chat-completions-ish".
one of my learnings from apfel is that is is very easy to get a kinda openAI api compatible server, and a lot of work to get it really totally compatible. sometimes i wonder if even the openai implementation of openai's api is openai api compatible to the core....
Ahh! I did not know that
> sometimes i wonder if even the openai implementation of openai's api is openai api compatible to the core….
It's a similar situation with "Arca-Swiss compatible" tripod plates in photography. There is really no such thing — Arca-Swiss didn't make a standard, so they didn't have to stick to it themselves, and while most things using this "standard" fix to most things, some things just won't fit, or won't stay put. Everyone implements it, and if they don't, people complain "why didn't you just put an Arca standard foot on it?" and then you have to sit them down and tell them.
I have been fairly much pissed off at the "hype in hyperscaler" AI growth (data center environmental and other societal costs) and I support anything we can do to promote local and private AI.
I also really want to hear more about their containerisation/seatbelt strategy now that they are offering MCP support. Not seen any news about Darwin inside their containers system.
(Apfel is a cool project; it’s been the only thing tempting me to upgrade to Tahoe)
Meet Core AI - https://developer.apple.com/videos/play/wwdc2026/324/
Dive into Core AI model authoring and optimization - https://developer.apple.com/videos/play/wwdc2026/325/
Integrate on-device AI models into your app using Core AI - https://developer.apple.com/videos/play/wwdc2026/326/
Does this completely replace the previous API, CoreML? [1]
"If your app uses model types other than neural networks, such as decision trees or tabular feature engineering, see Core ML."
- https://maderix.substack.com/p/inside-the-m4-apple-neural-en...
- https://news.ycombinator.com/item?id=47257931
- Core ML narrows to classic, non-neural ML (its own docs now point you there for "decision trees or tabular feature engineering")
- Core AI takes neural nets and transformers (the new .aimodel format, the new profiler)
- MLX stays the separate bring-your-own-weights track (its WWDC sessions draw no line back to Core AI at all)
coreai-opt is the successor to coremltools on the optimization side.
- Core ML is for models designed only for Apple platforms
- MLX is for models that don't need to be fast
- Core AI is for models that run everywhere already and also need to be fast
MLX is not for end user deployment.
https://developer.apple.com/private-cloud-compute/
Apple keeps MLX (bring-your-own-weights) separate from Foundation Models / Core AI.
The AI future will be clearly... what it will be. Probably bouncing back and forth from local to centralized. However, if there are money to be made by selling things that people run locally, it seems that centralizing creates more power and hence more money.
It doesn't matter how good the model is if it doesn't have context from data sources.
I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware.
I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf.
Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models).
It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement.
I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range.
PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines.
I am a fully-burned-out freelancer (in the last couple of years so severely and totally that I thought I had early onset dementia, and I am still not sure I don't). I don't really have an off-ramp to anything else yet, but the sea-change in the industry has been contributing to my feeling that I should knock it on the head.
I must get past broad understanding of AI to deep understanding, but I have to find a way to do this which sits well with freelancer ethics (sustainability, stability, control of destiny).
So I decided I would start out with that operating principle that ultimately this stuff is just going to be local: models will eventually hit some level of practicality for most tasks and technological progress guarantees that they will eventually run on desktops.
I decided to learn how to run models locally properly, see how far I get with opencode (and Pi and Zed experiments), and grow outwards from there to metered models (opencode go, openrouter etc.)
Knowledge first; what can I do that meaningfully changes my outcomes and confidence with no cost and no exposure to sudden change?
I have a secondhand M1 Max (excellent GPU bandwidth), and I am really shocked to find that arguably that level of practicality is already here.
Qwen 3.6 35B can really do a lot. And — not sure if you have tested it — but in some ways I think the Gemma 4 26B is better. Particularly for more commonplace dev tech — it is very knowledgeable about the sort of low-end web dev stack that is most common (Wordpress, PHP, MySQL).
I have been getting 75 tokens/sec with (GGUF) Gemma-4 26B QAT and MTP. (Can't get anywhere close with MLX, for some reason.)
A similar sort of speed with an MLX Qwen 3.6 35B. I have a sneaking suspicion that maybe llama.cpp is now faster than MLX on this older kit so I might try seeing what llama.cpp can do there, too.
Not blazing fast, but fast enough that there are plenty of experiments and small jobs I can do before I even get to using Big Pickle!
Without the MTP drafter, it is pretty consistently 75 tokens per second anyway, which is interesting.
With the MTP drafter it reaches well above 95 tokens per second handling the prompt and it will slowly drop to 65 or so with the output tokens as the prediction success rate slowly drops.
But with generated output it seems to me that the predictions are always going to drop dramatically over time.
I think my results here are broadly consistent with what people say about success rates with smaller and sparse models. I am going to test with n-max 4 in agentic situations at some point, and I may see whether it has much impact on the 31B model which is too slow to be practical otherwise.
I have a very unqualified feeling that MTP will matter more in agentic coding because of the larger prompts.
But my biggest issue since I installed it, I think, is that the combination is occasionally messing with markdown generation during thinking, and sometimes possibly losing the </think> at the end. I've seen it enough now to be fairly sure it is the Gemma MTP causing it. There is an open bug in the vLLM project about this and I wonder if something similar is going on in llama.cpp.
The speed without the MTP drafter is pretty solid so I am content to let more experienced people than me handle things while I learn other stuff, but I might go looking for some testing code that can prove it sometime.
https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...
Might see if Google has official drafters later.
Local is a pipe dream . If you can run it cheap occasionally why commercial companies can’t run it cheaper 24/7 and lower the costs ? The answer is simple. Use cases are more demanding and hence you need more from model not less .
Sure if you task is to do a narrow labeling task on 1m records small optimized model is good . If you want to do complex things , it shifts with models advancements
Your argument is actually really similar to an argument around the time Uber started kicking into gear and expanding.
It went:
---
"Why should I own a car when it's actually cheaper to just Uber for all my rides, compared to the cost of buying, maintaining, and insuring a car?"
---
And that wasn't an insane argument at that exact moment. Uber was pricing itself in the range of $5-$7 a ride, was novel and high quality.
Except take a look around today... Uber in my area went from ~$5 a ride to ~$27 a ride for the same trip. Uber's quality has also degraded quite a bit. It went from primarily high end, new cars with immaculately clean interiors to "average".
So want to make a wager on what's going to happen with cloud costs over the next decade for inference?
Because my strong hunch is they're going to follow exactly the same trend. They will stop being subsidized, providers WILL downgrade model quality to improve operating costs (and you'll have no control over this outside of enterprise contracts), and companies will start exploring "additional revenue options"... which means they'll shove ads and sponsored content into your results.
Is it worth being ~10-18 months behind the latest and greatest to avoid that entire set of shenanigans? I'd vote yes... I pay one time up front, and get usage limited by my hardware for the cost of electricity over a 10 year timeline. That's a decent deal with no surprises.
You're welcome to rent, but renting makes you subject to the whims of the owners. They're being very nice right now to attract all the flies. That's not a mistake, and it's absolutely a trap.
---
Side note - if you're only able to do labeling tasks with a local model... you're holding something very, very wrong.
I'm seeing some impressive results from folks that can afford 10k+ GPUs right now. But those GPUs will all be hand me downs in 10 years. So pipe dream? Hmmm...... that's not how this industry works.
But it doesn’t seem anywhere imminent with our current world state.
For whatever reason every generations thinks they are the peak. Naw man. You're just a blip at the bottom of the logarithmic chart.
- was the pause in model scaling a result of the benefits of RL & SFT being easier to access and quicker than scaling, or was it genuinely the result of scaling being low ROI now?
- are power densities necessary to provide high quality on device inference possible? Can the best, technically feasible, architectures accomodate T scale models and run them off batteries that fit in your hand?
- will thing slow down enough to allow edge depoloyments to realise value vs. centralised deployments.
- do edge use cases drive enough revenue to get this to happen?
- can local inference make up for model scale? Does that make sense in a latency/power race with the central infrastructure? Is there a sweet spot here?
I am not sure about any of the answers...
But it also doesn't capture the entire picture.
CPU metrics mostly stalled for two reasons.
1. There wasn't much demand for the extra capacity. Even low end cpus from a decade ago are plenty capable for just browsing the web and typing up documents. It takes a novel use-case to drive demand again (or a desire to do things like play new games).
2. The interest in CPU development shifted in response to mobile. Given point #1 and the state of battery development.... the blocker wasn't "performance". It was "performance per watt". And on that metric you couldn't be more wrong.
Since ~2005, MIPS per watt has improved 15x to 30x.
Also - fun news is that the traditional CPU pipeline really isn't the bottleneck for AI workloads. So we're going to see incredible interest in things like memory bandwidth and other inference related hardware bottlenecks, which haven't already been optimized.
It stalled before the rise of PC-as-Internet-portal.
I bought a high end PC in 2003, and 5 years later the PCs were not much faster - probably not even 2x. Around 2008-2010 was when most people started using PCs as a way to connect to the Internet.
It stalled because scaling got a lot more challenging. Not because of lack of demand.
I was building gaming machines in the early 2000s, I absolutely remember the 4ghz wall that cpus hit.
But it wasn't a real wall... because we then got one of the arguably most influential processors ever in the Core 2 duo. Which... blew the limit away by giving you two processors clocked at 2.93 GHz each.
And honestly, even then - it was lack of demand (we could go to 4+ghz, but we didn't want to pay the power bill for the rest of the system - the planned pentium 5 was 7-10ghz on paper, but they canceled the project because keeping it fed and cool was too hard for personal desktop machines).
Of Note - we did reach these speeds on consumer hardware (ex - in 2012, Andre Yang hit 8.794Ghz on an AMD FX-8350)
So it was never "impossible" to keep scaling. It just wasn't worth it compared to going multi-core.
---
And maybe it's because I was in my formative years at this time, but you're off by 5+ years with this:
> Around 2008-2010 was when most people started using PCs as a way to connect to the Internet.
Gmail was a web only email client released in 2004. Wikipedia was released in 2001. Web browsing was very much one of the "killer" apps for computers by the 2000s. What do you think the damn 2000s dot-com bubble crash was?
at the risk of aging myself - I was born in '89, and I literally do not remember a time where we didn't have DSL speeds and above (friends houses often still had dial-up until ~2005, though).
Well, Gmail was actually one of the last web based email clients people used :-) Yahoo mail, Hotmail, and so many others predate Gmail by years.
> Web browsing was very much one of the "killer" apps for computers by the 2000s.
One of them. People still used non-browser apps for all kinds of things: Media consumption (people didn't watch movies on Youtube), Office (Google Docs was very much a niche thing for many years), photo-editing (lots of pirated versions of Photoshop/Lightroom years after the iPhone release), etc.
Most non-mail, non-social media, non-shopping stuff people do on the web these days was a dedicated SW from the vendor in those days. Want to make a photobook? Download this Windows binary and set it up there. It will then communicate with the server for the order (no browser utilized).
> at the risk of aging myself - I was born in '89, and I literally do not remember a time where we didn't have DSL speeds and above (friends houses often still had dial-up until ~2005, though).
Spring chicken! My first online experience was on a 340 baud modem :-)
Qwen 3.5 was released 3/2/2026. It includes models up to a 397B-A17B model
https://huggingface.co/collections/Qwen/qwen35
A day afterwards, a high-up technical leader working on Qwen was let go
https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-st...
The more recent Qwen 3.6 was released on 4/16
https://huggingface.co/collections/Qwen/qwen36
This does not include any particularly large models. But the models it contains (Qwen3.6 27B and Qwen3.6 35B-A3B) are the local models people have been very excited about lately. So they didn't release any larger models, and the models people praise so much are from this most recent release.
I think what people didn't realize was, just because the GPT-4.5 model didn't get better on the benchmarks, didn't mean the model wasn't different than the earlier models. It was being compared to thinking models that were being developed at the same time.
The GPT 4.5 model still has some of the most "human" like abilities in communication even though it isn't particularly good a problem solving. It hadn't under gone the same type of reinforcement training.
I still use GPT 4.5 sometimes, in creative exercises it can be surprisingly effective. The model is still available.
I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.
Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.
[1] https://www.reddit.com/r/LocalLLM/comments/1kshq4f/electrici...
Might be time to sell, to be honest. It's fun to have that at home, but I can't justify having $10k (with memory, mobo, cpu, etc) sitting in my basement without being fully utilized.
Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?
Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.
Nowhere.
Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics
For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)
Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.
The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.
I expect I'll probably keep paying for whatever badass high IQ model is running on inference servers at that point
I haven't seen any sign that the framework fragmentation problem is going away anytime soon. NVIDIA wants everyone to do all training and inference with CUDA and to deny that NPUs have any usefulness. Everybody making an NPU has a different framework tailored to their architecture and the limitations they inherited from hardware designed before LLMs existed, and most of them have a another framework for targeting a GPU. And the OS vendor has one or two frameworks they would prefer you use rather than something hardware-specific.