Pedantic note: rust-cuda was created by https://github.com/RDambrosio016 and he is not currently involved in VectorWare. rust-gpu was created by the folks at embark software. We are the current maintainers of both.
We didn't post this or the title, we would never claim we created the projects from scratch.
`rust-GPU` and `rust-CUDA` fall in the category to me of "Rust is great, let's build the X ecosystem in rust". Meanwhile, it's been in a broken and dormant state for years. There was a leadership/dev change recently, (Are the creators of VectorWare the creators of Rust-CUDA, or the new leaders?), and more activity. I haven't tried since.
If you have a Rust application or library and want to use the GPU, these approaches are comparatively smooth:
- WGPU: Great for 3D graphics
- Ash and other Vulkan bindings: Low-level graphics bindings
- Cudarc: Nice API for running CUDA kernels.
I am using WGPU and Cudarc for structural biology + molecular dynamics computations, and they work well.
Rust - CUDA feels like lots-of-PR, but not as good of a toolkit as these quieter alternatives. What would be cool for them to deliver, and I think is in their objectives: Cross-API abstractions, so you could, for example, write code that runs on Vulkan Compute in addition to CUDA.
Something else that would be cool: High-level bindings to cuFFT and vkFFT. You can FFI them currently, but that's not ideal. (Not too bad to impl though, if you're familiar with FFI syntax and the `cc` crate)
Yes, it is all these folks getting together and getting resources to push those projects to the next level: https://www.vectorware.com/team/
wgpu, ash, and cudarc are great. We're focusing on the actual code that runs on the GPU in Rust, and we work with those projects. We have cust in rust-cuda, but that existed before cudarc and we have been seriously discussing just killing it in favor of cudarc.
> If you look at existing GPU applications, their software implementations aren't truly GPU-native. Instead, they are architected as traditional CPU software with a GPU add-on. For example, pytorch uses the CPU by default and GPU acceleration is opt-in. Even after opting in, the CPU is in control and orchestrates work on the GPU. Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity. This is not unique to pytorch. Most software is CPU-only, a small subset is GPU-aware, an even smaller subset is GPU-only, and no software is GPU-native.
> We are building software that is GPU-native. We intend to put the GPU in control. This does not happen today due to the difficulty of programming GPUs, the immaturity of GPU software and abstractions, and the relatively few developers targeting GPUs.
Really feels like fad engineering. The CPU works better as a control structure and the design of GPUs are not fitted for proper orchestration compared to CPUs. What really worries me is their mention of GPU abstractions, which is completely the wrong way to think about hardware designed for HPC. Their point about PyTorch. and kernels having low cyclomatic complexity is confusing to me. GPUs aren't optimize for control flow. The nature of SIMD/SIMT values throughput and the hardware design forgoes things like branch prediction. Having many independent paths a GPU kernel could take would make it perform much worse. You could very well end up with kernels that are slower than their optimize CPU counterparts.
I'm sure the people behind this are talented and know what they're doing, but these statements don't make sense to me. GPU algorithms are harder to reason about and implement. You often need to do more work just to gain the parallizable benefit. There aren't actually that many use cases were the GPU being the primary compute platform is a better choice. My cynical view is that people like the GPU because they compare unoptimize slow CPU code with decent GPU/tensorized code. They never see how much a modern CPU can actually do, and how fast it can be.
Be sure to turn on "pedantic mode" to get the footnotes that make this make more sense. Some examples of what this means by "applications" would help. I don't think the prediction here is that Excel's main event loop is going to run on the GPU, but I can see that its calculation engine might.
With current GPU architectures, this seems unlikely. Like, you would need a ton of cells with almost perfectly aligned inputs before even the DMA bus roundtrip is worth it.
We’re talking at least hundreds of thousands of cells, depending on the calculation, or at least a number that will make the UI very sad long before you’ll see a slowdown from calculation.
There isn't always a DMA roundtrip; unified memory is a thing. But programming for the GPU is very awkward at a systems level. Even with unified memory, there is generally no real equivalent to virtual memory or mmap() so you have to shuffle your working set in and out of VRAM by hand anyway (i.e. backing and residency is managed explicitly, even with "sparse" allocation api's that might otherwise be expected to ease some of the work). Better GPU drivers may be enough to mitigate this, along with broad-based standardization of some current vendor-specific extensions (it's not clear that real HW changes are needed) but this creates a very real limitation in the scale of software (including the AI kind) you can realistically run on any given device.
after reading this page, I still don't know what gpu native ware they want to work on.
> If you look at existing GPU applications, their software implementations aren't truly GPU-native. Instead, they are architected as traditional CPU software with a GPU add-on.
I feel that this is due to the current hardware architecture, not the fault of software.
We are investing in them and they form the basis of what we are doing. That being said, we are also exploring other technical avenues with different tradeoffs...we don't want to assume a solution merely because we are familiar with them.
One of the founders here, feel free to ask whatever. We purposefully didn't put much technical detail in the post as it is an announcement post (other people posted it here, we didn't).
2. Can modern GPU hardware efficiently make system calls? (if you can do this, you can eventually build just about anything, treating the CPU as just another subordinate processor).
3. At what order-of-magnitude size might being GPU-native break down? (Can CUDA dynamically load new code modules into an existing process? That used to be problematic years ago)
Thinking about what's possible, this looks like an exceptionally fun project. Congrats on working on an idea that seems crazy at first glance but seems more and more possible the more you think about it. Still it's all a gamble of whether it'll perform well enough to be worth writing applications this way.
1. The GPU owns the control loop And the only sparingly kicks to the CPU when it can't do something.
2. Yes
3. We're still investigating the limitations. A lot of them are hardware dependent, obviously data center cards have higher limits more capability than desktop cards.
Thanks! It is super fun trailblazing and realizing more of the pieces are there than everybody expects.
We didn't post this or the title, we would never claim we created the projects from scratch.
I routinely have to fix the autoformating done by HN.
seems like embark has disembarked from Rust and support for it altogether
If you have a Rust application or library and want to use the GPU, these approaches are comparatively smooth:
I am using WGPU and Cudarc for structural biology + molecular dynamics computations, and they work well.Rust - CUDA feels like lots-of-PR, but not as good of a toolkit as these quieter alternatives. What would be cool for them to deliver, and I think is in their objectives: Cross-API abstractions, so you could, for example, write code that runs on Vulkan Compute in addition to CUDA.
Something else that would be cool: High-level bindings to cuFFT and vkFFT. You can FFI them currently, but that's not ideal. (Not too bad to impl though, if you're familiar with FFI syntax and the `cc` crate)
wgpu, ash, and cudarc are great. We're focusing on the actual code that runs on the GPU in Rust, and we work with those projects. We have cust in rust-cuda, but that existed before cudarc and we have been seriously discussing just killing it in favor of cudarc.
> We are building software that is GPU-native. We intend to put the GPU in control. This does not happen today due to the difficulty of programming GPUs, the immaturity of GPU software and abstractions, and the relatively few developers targeting GPUs.
Really feels like fad engineering. The CPU works better as a control structure and the design of GPUs are not fitted for proper orchestration compared to CPUs. What really worries me is their mention of GPU abstractions, which is completely the wrong way to think about hardware designed for HPC. Their point about PyTorch. and kernels having low cyclomatic complexity is confusing to me. GPUs aren't optimize for control flow. The nature of SIMD/SIMT values throughput and the hardware design forgoes things like branch prediction. Having many independent paths a GPU kernel could take would make it perform much worse. You could very well end up with kernels that are slower than their optimize CPU counterparts.
I'm sure the people behind this are talented and know what they're doing, but these statements don't make sense to me. GPU algorithms are harder to reason about and implement. You often need to do more work just to gain the parallizable benefit. There aren't actually that many use cases were the GPU being the primary compute platform is a better choice. My cynical view is that people like the GPU because they compare unoptimize slow CPU code with decent GPU/tensorized code. They never see how much a modern CPU can actually do, and how fast it can be.
We’re talking at least hundreds of thousands of cells, depending on the calculation, or at least a number that will make the UI very sad long before you’ll see a slowdown from calculation.
Databases, on the other hand…
> If you look at existing GPU applications, their software implementations aren't truly GPU-native. Instead, they are architected as traditional CPU software with a GPU add-on.
I feel that this is due to the current hardware architecture, not the fault of software.
(Don't miss the "Pedantic mode" switch on the linked page, it adds relevant and detailed footnotes to the blog post.)
But, languages like Java or python simply lack even programming constructs to program on GPUs easily.
No standardised ISA on GPUs also mean compilers can’t really provide a translation layer.
Let’s hope things get better over time!
https://modal.com/docs/examples/batched_whisper
2. Can modern GPU hardware efficiently make system calls? (if you can do this, you can eventually build just about anything, treating the CPU as just another subordinate processor).
3. At what order-of-magnitude size might being GPU-native break down? (Can CUDA dynamically load new code modules into an existing process? That used to be problematic years ago)
Thinking about what's possible, this looks like an exceptionally fun project. Congrats on working on an idea that seems crazy at first glance but seems more and more possible the more you think about it. Still it's all a gamble of whether it'll perform well enough to be worth writing applications this way.
2. Yes
3. We're still investigating the limitations. A lot of them are hardware dependent, obviously data center cards have higher limits more capability than desktop cards.
Thanks! It is super fun trailblazing and realizing more of the pieces are there than everybody expects.