Author Topic: Flash backed LLMs on FPGA?  (Read 884 times)

0 Members and 1 Guest are viewing this topic.

Online NiHaoMikeTopic starter

  • Super Contributor
  • ***
  • Posts: 9225
  • Country: us
  • "Don't turn it on - Take it apart!"
    • Facebook Page
Flash backed LLMs on FPGA?
« on: September 23, 2024, 04:59:35 am »
My understanding of LLMs is that there's a very large "lookup table" that's iterated through many times while it generates the output. It's normally stored in RAM which gets quite expensive on the larger models. But what about storing it on Flash instead? I see a cited reason is that SSDs are too slow and would bottleneck it, but what if you just had a FPGA with a lot of Flash chips wired to it? Those are readily available in the form of "ioDrive" SSDs, with 365GB versions going for about $32. Reprogramming one of those for LLM use would be quite a task, but would it even be technically feasible?
Cryptocurrency has taught me to love math and at the same time be baffled by it.

Cryptocurrency lesson 0: Altcoins and Bitcoin are not the same thing.
 

Offline Marco

  • Super Contributor
  • ***
  • Posts: 6961
  • Country: nl
Re: Flash backed LLMs on FPGA?
« Reply #1 on: September 23, 2024, 06:40:34 am »
Traditional models need the entire set of parameters for each input (a dense model) but there is a lot of research showing only a small amount of parameters are relevant for an input (see for instance Turbosparse models). A dense model needs massive bandwidth to parameters to the point flash is not realistic. With dynamic sparsity in theory a parameter streaming LLM is possible, but the models are still very poorly optimized for it. The attention parameters (Q/K/V) are generally still dense and the predictors prevent good pipelining of parameter streaming and computation.

As of yet no one is doing models ground up designed for local/streaming inferrence, the Turbosparse Mixtral models are the best you can get. As for research into better models, there is only Pre-Gated MoE. It's not a very active field, most models for embedded use just try to stay tiny, not dynamically sparse.
« Last Edit: September 23, 2024, 06:48:48 am by Marco »
 
The following users thanked this post: Someone, glenenglish


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf