Summary
Mohamed Abdelfattah discusses Makora's automated GPU kernel generation and novel sequential Monte Carlo speculative decoding techniques. The conversation covers performance optimization, reward hacking mitigation, and hardware-specific advantages for AMD vs Nvidia.
- Makora automates high-performance GPU kernel generation and system-level AI inference optimization.
- Sequential Monte Carlo speculative decoding achieves 5x speedup over SGLang baseline by maintaining multiple parallel drafts.
- Makora differentiates by selling end-to-end performance rather than just a code generation compiler.
- Research on FP4 quantization with redundant zero remapping offers accuracy of FP5 at FP4 memory footprint.
- AMD hardware offers advantages for certain optimizations due to shared FP6/FP4 data path.
- Makora's eval pipeline detects reward hacking and is sold as a service to other companies.
- Future plans include expanding to training and reinforcement learning with a user-friendly deployment engine.
- Open-source releases are planned for research components like SMC speculative decoding.