LLM domain adaptation using continued pre-training — Part 3/4

Anastasia Tzeveleka
4 min readMay 9, 2024

--

Exploring domain adaptation via continued pre-training for large language models (LLMs)? This 4-part series answers the most common questions on why, how, and when to perform domain adaptation of large language models (LLMs) via continued pre-training.
Written by:
Anastasia Tzeveleka, Aris Tsakpinis, and Gili Nachum

Part 1: Introduction
Part 2: Training data — sourcing. selection, curation and pre-processing
Part 3: Continued pre-training on AWS — you are here!
Part 4: Advanced: Model choice and downstream fine-tuning

Continued pre-training on AWS

In part 1, we reviewed domain adaptation and the different approaches you can use to adapt an LLM to your specific domain. We then zoomed into domain adaptation with continued pre-training on unstructured data. In part 2, we discussed sourcing, selecting, and curating training data for continued pre-training.

An easy way to get started with continued pre-training is to use AWS. AWS provides not just the infrastructure you need for your training but also services and capabilities that accelerate and simplify your domain adaptation journey.

Q. Which AWS AI/ML services can I use for continued pre-training?

You can choose from the following services:

  1. Amazon Bedrock is a fully managed service that offers a choice of foundation models (FMs). It is also serverless which means you don’t have to manage the infrastructure. Bedrock continued pre-training for text-to-text models is available in N.Virginia and Oregon at the time of writing for Amazon Titan Text Express and Amazon Titan Text Lite FMs. You can provide unlabeled data and start the customization process as described here. Your training set should be in JSON lines format and each JSON line should be a sample containing only an input field {“input”: “<input text>”} (source). Certain quotas apply. Note that currently you cannot fine tune the output model.
  2. Amazon SageMaker JumpStart is an Amazon curated machine learning (ML) hub which contains publicly available and proprietary FMs. It is available in all regions where Amazon SageMaker is available. You can deploy these models and also fine-tune text-generation models for domain adaptation in your own AWS account using the UI or the SageMaker SDK. JumpStart comes with AWS pre-built container images and training scripts optimized for the AWS infrastructure so that you don’t have to build these yourselves. Not all models available on JumpStart are fine-tunable. You can verify if the model is fine-tunable via JumpStart here. Your training set should be in one of the following formats: CSV, JSON, or TXT file.

AWS offers resources to help you get started, some examples are below:

3. Amazon SageMaker Training enables you to run continued pre-training on ephemeral clusters managed by SageMaker. SageMaker does all the heavy lifting of managing the infrastructure, networking and throughput requirements. In most cases, you’ll use SageMaker if you want to want to pre-train a model for example from HuggingFace which can’t be found in Bedrock or SageMaker Jumpstart. Amazon SageMaker offers:

  • Transient compute: the instance is alive only for duration of the continued pre-training code, billed per second, leading to extremely granular and cost-effective compute consumption.
  • Instance management: SageMaker includes automated GPU health check and node-level failure remediation
  • Distributed training: SageMaker offers libraries for data parallel and model parallel. They are optional to use and you can also bring their own libraries.
  • I/O automation during training: SageMaker has features to read datasets from Amazon S3, FSx for Lustre or EFS. It can also collect and export data in parallel to training happening, via the Checkpoint feature.
  • Managed metastore. SageMaker automatically creates and populates a metastore to persist training details such as input, output, Docker image details, metrics, logs, hyperparameters.
  • Reduced compute start latency with Warm Pools

4. Amazon SageMaker Hyperpod: if you want to use resilient clusters for continued pre-training. It provides granular control over the cluster and flexibility, for example, you can install additional software on the cluster while the UI is highly customizable using Slurm. You can choose the instance count and type for the clusters.

Q. What SageMaker capabilities can I use to process the data for continued pre-training?

  • SageMaker Training gives you access to ephemeral AWS-managed homogeneous and heterogeneous clusters and spot instances and native integration with Amazon S3, EFS And FSX for Lustre. You can use frameworks such as Apache Spark or Ray with SageMaker training. SageMaker allows you to extend an existing pre-build Docker or bring your own custom container and make additional libraries and frameworks available in your training environment.
  • SageMaker Processing gives you access to ephemeral, AWS-managed homogeneous clusters and offers native integration with Amazon S3, Redshift and Athena. You can use SageMaker Processing to perform distributed data processing with Apache Spark or Ray. For Spark, Amazon SageMaker provides prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing job. You can also bring additional frameworks by using a custom processing container.
  • Amazon SageMaker Hyperpod provides a resilient training environment which you can also use for data processing.

Next:

Part 1: Introduction
Part 2: Training data — sourcing. selection, curation and pre-processing
Part 3: Continued pre-training on AWS
— you are here!
Part 4: Advanced: Model choice and downstream fine-tuning

--

--

Anastasia Tzeveleka

GenAI and Machine Learning Solutions Architect at AWS. All opinions are my own and do not represent the views of my employers.