Traditional models need the entire set of parameters for each input (a dense model) but there is a lot of research showing only a small amount of parameters are relevant for an input (see for instance Turbosparse models). A dense model needs massive bandwidth to parameters to the point flash is not realistic. With dynamic sparsity in theory a parameter streaming LLM is possible, but the models are still very poorly optimized for it. The attention parameters (Q/K/V) are generally still dense and the predictors prevent good pipelining of parameter streaming and computation.
As of yet no one is doing models ground up designed for local/streaming inferrence, the Turbosparse Mixtral models are the best you can get. As for research into better models, there is only Pre-Gated MoE. It's not a very active field, most models for embedded use just try to stay tiny, not dynamically sparse.