Despite Apple's preference for its own silicon in AI tasks, the company has collaborated with NVIDIA to develop 'ReDrafter,' a new technique that speeds up text generation with large language models (LLMs). This collaboration highlights a shared goal of improving LLM performance, despite the complex history between the two tech giants.
'ReDrafter' Technique
Apple's open-sourced 'ReDrafter' combines beam search and tree attention to enhance text generation performance. This technique was then integrated into NVIDIA's TensorRT-LLM, a tool designed to accelerate LLMs on NVIDIA GPUs. This integration improves speed and reduces latency, while also decreasing power consumption.
"This research work demonstrated strong results, but its greater impact comes from being applied in production to accelerate LLM inference... ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM." - Apple
Integration with TensorRT-LLM
To integrate ReDrafter, NVIDIA added new operators and exposed existing ones, significantly improving TensorRT-LLM's ability to handle complex models and decoding methods. With these enhancements, developers using NVIDIA GPUs can now easily leverage ReDrafter for faster token generation in their production LLM applications. Benchmarks have shown a 2.7x speed-up in generated tokens per second for greedy decoding, using the NVIDIA TensorRT-LLM with ReDrafter. This could considerably reduce user latency while consuming less power.
While this collaboration indicates a shared interest, a long-term partnership seems unlikely given the history between Apple and NVIDIA. We may see similar collaborations in the future, but a formal business relationship is not anticipated.
Source: Apple