Recent advances in the capabilities of large language models (LLMs) have led to the development of language agents to address complex, multi-step tasks. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering.
To address this, researchers have introduced HUSKY, a holistic, open-source language agent that learns to reason over a unified action space to address a diverse set of complex tasks involving numerical, tabular, and knowledge-based reasoning. HUSKY solves these multi-step tasks by jointly predicting the next high-level step and tool with an action generator, and executing the action with the assigned expert model. This process repeats until it arrives at the final answer.
HUSKY iterates between two stages. The first module in HUSKY is the action generator. Given the input question and the solution generated so far, the action generator jointly predicts the next high-level step to take and the associated tool. The tools forming the ontology of actions are [code], [math], [search] and [commonsense]. If the final answer to the question has been reached in the solution history, then the action generator returns the answer. Based on the tool assigned by the action generator, HUSKY calls the corresponding tool, executes the tool and re-writes the tool outputs optionally into natural language. Each tool is associated with an expert model - a code generator for [code], a math reasoner for [math], a query generator for [search] and a commonsense reasoner for [commonsense].
During experimentation HUSKY generalizes across multiple tasks better than other language agents including FIREACT and LUMOS, and outperforms other agents in tasks of their own expertise. For example, HUSKY outperforms LUMOS on GSM-8K [6] by more than 20 points and FIREACT on HotpotQA [53] by 5 points. HUSKY also outperforms FINMA [51] on FinQA [5] by 9 points and CRITIC-70B [11] on TabMWP [21] by 1.8 points. On HUSKYQA, HUSKY with a 13B action generator scores within 1 points behind gpt-4o.
In conclusion, these results showcase a robust recipe for developing HUSKY, an open-source language agent that generalizes and achieves competitive performance across a wide array of multi-step reasoning tasks.
Paper : https://arxiv.org/pdf/2406.06469