Strategies to Continually Pre-train Large Language Models

Continual Pre-training

While Fine-Tuning, RAG, and Prompt Engineering are covered exhaustively, not a lot of attention is paid to continuous pre-training of LLMs.

Why Continually Pre-train LLMs?

As new data becomes available, continuous pre-training is a much more efficient solution compared to starting all over again with full pre-training, saving significant compute costs.

What Are the Tradeoffs?

The distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data.

What Can You Do About It?

Authors of “Simple and Scalable Strategies to Continually Pre-train Large Language Models” propose a workaround by using:

  1. Re-warming + re-decaying the learning rate
  2. Using replay — incorporating the original dataset during continued training

How Do You Do It?

  • The easiest way: using the continuous pre-training option in Amazon Bedrock
  • For more control over your training job (custom learning rate schedules, re-warmup + re-decay): use SageMaker JumpStart

While out-of-box continuous pre-training will theoretically work on most models, your options are better with open source models which publish information on their pre-training and the dataset.

Read the paper