Wouldn't scale to many users, unfortunately. 2-4 users with some optimization would deliver up to 20 tokens/s for most queries, which isn't good, espe...
Wouldn't scale to many users, unfortunately. 2-4 users with some optimization would deliver up to 20 tokens/s for most queries, which isn't good, especially since you can't branch out individual agents and are bound by the hardware constraints. Hardware costs, energy use and maintenance would make this a moneydump, I fear. Adding more nodes didn't significantly bump token rates.
But just being able to self host such a model unquantized would've been unthinkable just 1-2 years ago, even with various hacks (like offloading to SSDs) on consumer hardware alone.
I hope we'll see small groups of people hosting private llms sustainably though. Trusted circles and their oracles, basically. 🌚