Developing a BLAS Library for the AMD AI Engine [pdf]

(uni.tlaan.nl)

46 points | by teleforce 2 days ago

5 comments

  • titanix88 2 days ago
    Looks like the author have not used software pipelining compiler directives with the kernel loops. AMD AIE architecture has 5 cycle load/store latency and 7 cycle FP unit latency. With software pipelining, they could have 5-10x speed up for long loops.
  • fooblaster 2 days ago
    This architecture is likely going to be a dead end for AMD. It has been in the wild for several years, yet still has no open programming model, multiple compiler stacks with poor software support. I find it likely that AMD drops this architecture and unifies their ML support around their GPGPU hardware.
  • kouteiheika 2 days ago
    So it's called an "AI Engine", but its performance is worse than just running the same thing on CPU? Doesn't it make it essentially useless for anything AI related? What's the point of this hardware then? Better power efficiency for tiny models? Surely someone must be using it for something?
    • heavyset_go 2 days ago
      The point is offloading ML workloads to hardware that is energy efficient, not necessarily "fast" hardware.

      You want to minimize the real and energy costs at the expense of time.

      Assuming NPUs don't get pulled from consumer hardware altogether, theoretically the time/efficiency trade-off gap will become smaller and smaller as time goes on.

    • shetaye 2 days ago
      The CPU baseline seems to be the beefy host CPU. The AIE is presumably faster than what you could do with the FPGA (DPS, LUT, etc.) alone.
  • nl 2 days ago
    Note that this is BLAS on the AMD/Xilinx VCK5000 FPGA: https://www.amd.com/en/products/adaptive-socs-and-fpgas/eval...
    • heavyset_go 2 days ago
      How does this line compare to the Ryzen AI branded Xilinx FPGAs in newer mobile AMD APUs?
      • wmf 2 days ago
        The Ryzen AI NPU is from Xilinx but it's not an FPGA BTW.
        • heavyset_go 2 days ago
          I thought the XDNA line was related to Xilinx's Versal (or Alveo, I forget) lines that use FPGA fabric?

          Or maybe I'm misinterpreting press releases, as evidently Notebookcheck.net lied to me years ago :(

          [1] https://www.notebookcheck.net/AMD-details-4-nm-Zen-4-Ryzen-7...

          • wtallis 1 day ago
            It's an IP block that Xilinx can provide for use on their FPGAs, but as implemented on the Ryzen parts it's synthesized into a hard IP block, not an FPGA block plus bitstream.
  • imtringued 1 day ago
    I know this is a master thesis but I'm kind of disappointed by this. The AMD AI Engine is a GEMM and Flash Attention workhorse. Those are the primary workloads and the non-sparse versions of those workloads map 1:1 to the AI Engine. We don't see that in this master thesis.

    Like, I'm sitting here on the sidelines and thinking that someone is going to do implement this stuff before I even get a chance, which is why I never mention the blatantly obvious communication pattern that is breathing down your neck that the AI Engines are begging you to implement. Doing Flash Attention is slightly more difficult, but not meaningfully so.

    If you are using broadcasting to spread your A and B matrices, you're doing it wrong. You can do the thing that others do inside their processor "outside". Once you understand that, you will start to realize that this is actually the best possible architecture for dense GEMM and dense FlashAttention.