Author Topic: Flash backed LLMs on FPGA? (Read 884 times)

NiHaoMike · « **on:** September 23, 2024, 04:59:35 am »

My understanding of LLMs is that there's a very large "lookup table" that's iterated through many times while it generates the output. It's normally stored in RAM which gets quite expensive on the larger models. But what about storing it on Flash instead? I see a cited reason is that SSDs are too slow and would bottleneck it, but what if you just had a FPGA with a lot of Flash chips wired to it? Those are readily available in the form of "ioDrive" SSDs, with 365GB versions going for about $32. Reprogramming one of those for LLM use would be quite a task, but would it even be technically feasible?

Marco · « **Reply #1 on:** September 23, 2024, 06:40:34 am »

Traditional models need the entire set of parameters for each input (a dense model) but there is a lot of research showing only a small amount of parameters are relevant for an input (see for instance Turbosparse models). A dense model needs massive bandwidth to parameters to the point flash is not realistic. With dynamic sparsity in theory a parameter streaming LLM is possible, but the models are still very poorly optimized for it. The attention parameters (Q/K/V) are generally still dense and the predictors prevent good pipelining of parameter streaming and computation.

As of yet no one is doing models ground up designed for local/streaming inferrence, the Turbosparse Mixtral models are the best you can get. As for research into better models, there is only Pre-Gated MoE. It's not a very active field, most models for embedded use just try to stay tiny, not dynamically sparse.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Flash backed LLMs on FPGA? (Read 884 times)

NiHaoMike

Flash backed LLMs on FPGA?

Marco

Re: Flash backed LLMs on FPGA?

Share me