If you mean distributing inference across many machines, each of which could not individually deal with a large model, using today’s models, not viable with reasonable performance. The problem is that you require a lot of bandwidth between layers; a lot of data moves. When you cluster current systems, you tend to use specialized, high-bandwidth links.
It might theoretically be possible to build models that are more-amenable to this sort of thing, that have small parts of a model run on nodes that have little data interchange between them. But until they’re built, hard to say.
I’d also be a little leery of how energy-efficient such a thing is, especially if you want to use CPUs — which are probably more-amenable to be run in a shared fashion than GPUs. Just using CPU time “in the background” also probably won’t work as well as with a system running other tasks, because the limiting factor isn’t heavy crunching on a small amount of data — where a processor can make use of idle cores without much impact to other tasks — but bandwidth to the memory, which is gonna be a bottleneck for the whole system. Also, some fairly substantial memory demands, unless you can also get model size way down.
























I don’t know what YouTube Rewinds are, but are these them? I seem to be able to view them.
https://www.youtube.com/playlist?list=PLTTASUq6isfvyOXnYzM8Jgc28tET8PMc4