Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I almost tried to install AMD rocm a while ago after discovering the simplicity of llamafile.

  sudo apt install rocm

  Summary:
    Upgrading: 0, Installing: 203, Removing: 0, Not Upgrading: 0
    Download size: 2,369 MB / 2,371 MB
    Space needed: 35.7 GB / 822 GB available
I don't understand how 36 GB can be justified for what amounts to a GPU driver.


So no doubt modern software is ridiculously bloated, but ROCm isn't just a GPU driver. It includes all sorts of tools and libraries as well.

By comparison, if you go and download the CUDA toolkit as a single file, you get a download file that's over 4GB, so quite a bit larger than the download size you quoted. I haven't checked how much that expands to (it seems the ROCm install has a lot of redundancy given how well it compresses), but the point is, you get something that seems insanely large either way.


I suspected that, but any binaries being that large just seems wrong, I mean the whole thing is 35 time larger than my entire OS install.

Do you know what is included in ROCm that could be so big? Does it include training datasets or something?


Here's the big files in my /opt/rocm/lib which is most of it:

  4.8G hipblaslt
  1.6G libdevice_conv_operations.a
  2.0G libdevice_gemm_operations.a
  1.4G libMIOpen.so.1.0.60200
  1.1G librocblas.so.4.2.60200
  1.6G librocsolver.so.0.2.60200
  1.4G librocsparse.so.1.0.60200
  1.5G llvm
  3.5G rocblas
  2.0G rocfft
The biggest one just to pick on one is hipblaslt is "a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics." https://github.com/ROCm/hipBLASLt

There are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg:

  304K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx942.co
  24K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx942.dat
  240K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx942.co
  20K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx942.dat
  344K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx942.co
  24K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx942.dat


Ok so like four of those files literally just do matrix multiplications


"just"


Ok some of them do tensor contractions too my bad


My understanding is that ROCm contains all included kernels for each supported architecture, so it would have (made up):

  -- matrix multiply 2048x2048 for Navi 31,
  -- same for Navi 32,
  -- same for Navi 33,
  -- same for Navi 21,
  -- same for Navi 22,
  -- same for Navi 23,
  -- same for Navi 24, etc.
  -- matrix multiply 4096x4096 for Navi 31,
  -- ...


Correct. Although, you wouldn't find Navi 22, 23 or 24 in the list because those particular architectures are not supported. Instead, you'd see Vega 10, Vega 20, Arcturus, Aldebaran, Aqua Vanjaram and sometimes Polaris.

We're working on a few different strategies to reduce the binary size. It will get worse before it gets better, but I think you can expect significant improvements in the future. There are lots of ways to slim the libraries down.


You can look us up at https://github.com/zml/zml, we fix that.


Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?


We do use ROCm and CUDA. Only we sandbox it with the model and download only the needed parts which are about 1/10th of the size.


CPU drivers are complete OSes that run on the GPUs now.


It's not just you; AMD manages to completely shit-up the Linux kernel with their drivers: https://www.phoronix.com/news/AMD-5-Million-Lines


> Of course, much of that is auto-generated header files... A large portion of it with AMD continuing to introduce new auto-generated header files with each new generation/version of a given block. These verbose header files has been AMD's alternative to creating exhaustive public documentation on their GPUs that they were once known for.


There have been talks about moving those headers to a separate repo and only including the needed headers upstream[1]

[1]: https://gitlab.freedesktop.org/drm/amd/-/issues/3636


OpenBSD, too.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: