Recently Google has released RecurrentGemma, an open language model which uses Google’s novel Griffin architecture. Griffin combines linear RNN with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences.
Typically, transformer architecture KV cache grows linearly with sequence length. Although there are various techniques such as local attention to reduce the cache size but it comes at the expense of reduced performance. In contrast, RecurrentGemma-2B compresses input sequences into a fixed-size state without sacrificing performance. This reduces memory use and enables efficient inference on long sequences.
During evaluating RecurrentGemma-2B across a broad range of domains, using a combination of automated benchmarks and human evaluation. RecurrentGemma-2B achieves comparable performance to Gemma-2B, even though Gemma-2B was trained on 50% more tokens. In creative writing and coding tasks, RecurrentGemma-2B-IT achieves a 43.7% win rate against the larger Mistral 7B model. A key advantage of RecurrentGemma is its inference speed, which is roughly 40k tokens per second, considerably higher than your average transformer architecture based models In conclusion, RecurrentGemma-2B offers the performance of Gemma, while achieving higher throughput during inference, especially on long sequences.
Paper : https://lnkd.in/g3YS_su9