The AMD tinybox is on hold until we can build and run the relevant firmware on our GPUs. The driver is still very unstable, and when it crashes or hangs we have no way of debugging it. We have no way of dumping the state of a GPU. Apparently it isn't just the MES causing these issues, it's also the Command Processor (CP). After seeing how open @tenstorrent is, it's hard to deal with this. With tenstorrent, I feel confident that if there's an issue, I can debug and fix it. With AMD, I don't. We are exploring Intel, working on adding Level Zero support to tinygrad. We also added a $400 bounty for XMX support. We are also (sadly) exploring a 6x4090 box. At least we know the software is good there. We will revisit AMD once we have an open and reproducible build process for the driver and firmware. We are willing to dive really deep into hardware to make it amazing. But without access, we can't.
I have spoken with AMD on multiple occasions, we have gotten through to top people, and they have been quite nice to us. I believe they want to be more open, and obviously they don't want their driver to have bugs. Unfortunately, this access and responses prolonged this decision, part of me wishes they just said it's a consumer card, you get what you pay for and we could have switched earlier. We probably tried too hard to make it work. We have an amazing team at tinygrad. Someday, we are going to make our own chips, and I figure if we can make our own chips, we better be able to make the 7900XTX software great. But we can't if we don't have access. The firmware is complex, undocumented, closed source, and signed, all struggles we wouldn't have with our own hardware. If and when the firmware is open and installable, if we aren't too far along with a different chip, we are down to put resources into writing fuzzers and rewriting whatever needs to be rewritten. The 7900XTX hardware seems great, but we aren't going to put resources into fixing a black box.
@__tinygrad__ @LisaSu Radeon team need to step up for a viable solution here.
@__tinygrad__ On Saturday you seemed very positive about AMD and mentioned that there’ll be an announcement from them in the next 1-2 weeks (not sure about what). What happened?
@__tinygrad__ One thing to consider is that Nvidia is the leader, good stuff, but they have the luxury of purposefully gimping the GPU poor cards, that includes the highest end Geforce cards. Intel on the other hand has every reason to punch up as hard as possible as much as possible.
@__tinygrad__ If only inference matters, why not use @nvidia Jetson AGX Orin modules as x8 PCIe EPs? 138 INT8 TOPS 64GB (204.8GB/s) 60W on-board NVMe Requires a custom carrier board but 128 lanes fit 16 modules. Less TOPS than GPUs but 1TB of memory and only ~1.4kW all-in.