RAG vs Fine-Tuning: A Comprehensive Evaluation

RAG vs Fine-Tuning

The RAG vs Fine-Tuning debate is always interesting, so when this comprehensive evaluation study was released, I had to read it.

What They Covered

A comprehensive evaluation of LLMs including base models, with RAG, and with fine-tuning. It involved data collection, extraction, assessing data and model quality, and a series of experiments.

How They Went About It

They start by gathering domain-specific data (agriculture), extracting information, and creating Question and Answer (Q&A) pairs, then fine-tuning the models with these pairs.

Models Used

GPT-4, GPT-3.5, Llama2-13B, Llama-2-chat-13B, and Vicuna. They also use Facebook AI Similarity Search (FAISS) to create a database of the embeddings.

Results

Data Extraction: JSON proved most effective for extracting data from complex hierarchical documents
Q&A Generation: GPT-4 excelled in all metrics but tended to produce verbose outputs
Retrieval: Increasing ‘k’ in Top-k improved recall but decreased with larger indexes due to collisions
Fine-Tuning vs. RAG: GPT-4 consistently outperformed others. Models show a cumulative increase in performance when both fine-tuning and RAG are used together

The Tradeoff

The low initial cost of RAG can make it an attractive option, however it is important to consider the input token cost of the prompt. Fine-tuning on the other hand produces precise and succinct outputs with a high initial cost and extensive work involved in the fine-tuning process.

Read the paper