Orca 2: A Small Fish in the Ocean of AI Models

January 19, 2025
Science Magazine

Microsoft describes its newest large language model (LLM), Orca 2, as the Swiss army knife of AI—a small and compact system that can effectively handle a variety of tasks. With AI’s rapidly expanding applications, the race to develop LLMs continues to grow. At their basis, LLMs are designed to showcase a semantic-based proficiency that allows them to mimic a broadly digestible form of intelligence. With the input of human written prompts, LLMs seek to decode human language and produce human-like responses.

However, for LLMs to effectively understand and mimic human language, they require extensive training—much like teaching an infant to speak. Just like infants, LLMs begin with no knowledge of language but are subsequently immersed in vast amounts of information to learn how to “speak.” By analyzing examples like human dialogue or academic work, LLMs develop a general sense of language patterns and how to predict them. Through the production of different responses and the corresponding rankings inputted by humans, LLMs attain different parameters, or variables used to predict semantic patterns more accurately.

In fact, LLMs like GPT-4 and Llama 3.1 heavily rely upon hundreds of billions, and in some cases,  trillions of parameters. In the final result of these modified guidelines and parameters, the general public can enjoy the potential of the largest LLMs.

Training an AI model can create a major carbon footprint.

Above: Carbon dioxide emission benchmarks. Image courtesy of the College of Information and Computer Sciences at the University of Massachusetts Amherst.

But bigger isn’t always better. As of February 2023, researchers estimated that sustaining ChatGPT’s large number of parameters costs approximately $700,000 per day. Beyond the direct financial cost, several indirect costs are often overlooked, such as the water required to cool the power plants supplying these immensely taxing models. As of last year, researchers suggested that processing anywhere from five to 50 prompts consumed approximately one 16-ounce water bottle. With hundreds of millions of weekly users, this figure has undoubtedly increased. However, many consider these costs a necessary tradeoff for the extensive capabilities of large models. Smaller LLMs with less than 10 billion parameters lack the robust reasoning needed to support capabilities such as zero-shot learning, or solving problems that aren’t explicitly pre-programmed. For example, ChatGPT’s comprehensive training allows it to pass an MBA exam without specific pre-programming, demonstrating why we continue to rely on models with an extreme number of parameters despite their hefty costs.

Power in Compact AI

Orca 2 is Microsoft’s answer to the challenge of balancing model size with performance, delivering a more resource-efficient model without sacrificing the machine’s capabilities. Available in two sizes, either 7 billion or 13 billion parameters, Orca 2 consistently outperforms models that are five to 10 times its size, such as LLaMA-2-Chat-70B and WizardLM-70B. This includes excelling in zero-shot learning tasks that typically serve as a distinguishing factor between LLMs and smaller models.

Orca 2’s Redefinition of AI Training 

How does Orca 2 achieve this result? Like other smaller models, Orca 2 starts with a base model, which is LLaMA-2. However, many smaller models begin to have issues when training these base models.

Like Orca 2, smaller models go through a preliminary tuning phase. There, researchers train models on when to produce specific behaviors based on how tasks are worded. In this phase, researchers feed models general-purpose statements like “think step-by-step” or “generate detailed answers.” These prompts aim to encourage the production of ideal responses by guiding the model towards different problem-solving strategies. However, not every combination of prompts creates the most desirable output. Due to their limited capacity, smaller models often struggle to perform beyond what they’ve learned in this phase. This limitation makes it difficult for them to respond appropriately to novel scenarios, leading to outputs that may appear stylistically correct but are conceptually flawed.

This is where Orca 2 changes the game. Microsoft employs a training process called “Cautious Reasoning” training, using a teacher-student analogy. In this process, a teacher LLM begins with a set of tasks to ultimately decide what solution strategies are appropriate for a valid generated response. Orca 2, the student LLM, is then provided the tasks and correct responses but not the solution strategy used by the teacher LLM. This technique known as “Prompt Erasure” is what makes Orca 2 unique, in that it must independently reason when it is best to “think step-by-step” or “generate detailed answers.” With this novel training approach, scientists train Orca 2 on various datasets to improve its situational reasoning abilities. The datasets include the FLAN-v2 dataset, millions of Chat-GPT data, and varying specialized subsets, such as data for math problems and doctor-patient conversations.

Orca 2’s versatility has been tested with a broad range of benchmark tests against several LLMs. One such benchmark is GSM8K, which focuses on math word problems like “Beth bakes 4, 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?” In questions like these, researchers test Orca 2 on its ability to reason through problems rather than solely relying on recollection when compared to other models.

Across several benchmarks like GSM8K, Orca 2 performed comparably or even exceeded the performance of larger models. On this basis, Microsoft made claims of superior reasoning capabilities within its model.

Results comparing Orca 2 (7B & 13B) to LLaMA-2-Chat (13B & 70B) and WizardLM (13B & 70B) on variety of benchmarks (in 0-shot setting) covering language understanding, common sense reasoning, multi-step reasoning, math problem solving, etc. Orca 2 models significantly surpass other models including models 10x larger. Note that Orca 2 models were trained by continual training of LLaMA-2 base models of the same size.

Above: Comparison between the performance of Orca 2, LLaMA-2-Chat, and WizardLM on varying benchmarks. AGI: AGIEval is a set of different standardized tests like the GMAT, SAT, etc; BBH: Big-Bench Hard is a set of 23 difficult multi-step reasoning questions; MMLU:  Massive Multitask Language Understanding benchmark used to test language understanding; ARC-E: Science multiple choice questions that are of the easy subset; ARC-C: Science multiple choice questions that are of the challenge subset; RACE: English reading comprehension questions given to Chinese students aged 12-18; GSM8K: Collection of multi step, mathematical word problems. Image courtesy of Awadallah et al., 2023.   

These claims about Orca 2’s capabilities stem from its distinctive design, which mimics elements of human psychology. While its results may suggest that Orca 2 is capable of human-like cognition, true reasoning goes beyond mere pattern recognition and data manipulation. These criteria raise the question of whether reasoning is defined by the result or by the process used to achieve it. However, Orca-2 is unique in that it is a model that creates the most human-like combination of the two through a ‘smaller is smarter’ approach. It is this balance that makes Orca 2 especially fascinating. While Orca 2 might not settle the debate on whether or not AI can truly reason, it certainly brings us closer to models that ‘think’ like us. 

Above: Image courtesy of Medium.

Related Articles