The challenge in building a general-purpose post recognition error corrector, which is required to evaluate your fine tuned models on a custom dataset, is how to train a model effectively on diverse domain datasets. The solution involves learning domain-specific features and integrating this knowledge into one model. However, previous approaches have used separate correction models for each domain, which leads to a substantial increase in the number of model parameters.
To address this, researchers from Nvidia have proposed NEKO (geNErative multi-tasK error cOrrection), a new form multi-task model to boost post-recognition results over speech, text, and visual inputs.NEKO leverages a pre-trained MoE model to drive diverse tasks and cross-domain knowledge. The key idea is to continuously pretrain the MoE model on a mixture of error correction datasets, with each expert specializing in a specific domain. This task-oriented MoE fine-tuning approach enables the experts to capture task-specific features while allowing knowledge sharing through the router
Architecturally NeKo integrates MoE layers within a Transformer architecture. In a MoE layer, each input token is assigned to a subset of experts by a gating network (router). The output of the MoE layer is the weighted sum of the outputs of the selected experts, where the weights are determined by the gating network. During inference, NeKOo does not assume knowledge of the specific task an input belongs to and each token is routed to the top-2 experts solely based on their router probabilities.
NEKO could work for (i) post automatic speech recognition (ASR) correction, (ii) post speech translation (ST) and machine translation (MT) correction, and (iii) post optical character recognition (OCR) correction. NeKo discovered new state-of-the-art results in (iv) zero-shot ASR correction and performs competitively as a general-purpose (v) multi-task corrector.
Experiments on the Open ASR Leaderboard show that NeKo achieved a new state-of-the-art performance by achieving an average relative 5.0% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with 15.5% to 27.6% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
Paper: NEKO: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts