Blog

GPTeacher: Replacing hand-labelled datasets for just $10

To webinar

Jamie Dborin

June 23, 2023

•

Harnessing large language models for efficient dataset creation

Training small and highly accurate language models using datasets generated by larger language models for news classification. Match fine-tuning performance with hand labeled data with $10 of synthetic data!

In the field of machine learning, one of the major challenges is obtaining high-quality labeled datasets for training models. This is particularly true for natural language processing (NLP) tasks where finding labeled data can be a daunting task. However, with the rise of Large Langage Models (LLMs), such as GPT-3/4, there is now a novel approach to generate datasets for fine-tuning smaller language models.

In this article, we explore a unique use case of building classifiers by leveraging LLMs. Specifically, we focus on the task of news topic classification, where we demonstrate how to create a curated dataset using a combination of hand-labeled examples and few-shot generation techniques. This approach allows for the efficient and cost-effective creation of small, fine-tuned language models that can be easily deployed at scale.

Check out the colab

The pain of data collection

One of the biggest bottlenecks in machine learning projects is the collection, cleaning, and curation of datasets suitable to train models on. This is true even in the case of natural language processing (NLP). There may be documents and text all over the place, but this is a case of data being everywhere, but not a drop of it is labelled.

Commercially oriented teams get over this in a number of ways; they can opt for services like Amazon’s Mechanical Turk to crowd source labelled data; collate relevant open-source and liberally licensed datasets; turn to the dark-arts of self-supervision; or get unfortunate interns and graduates to roll their sleeves up and start labelling.

Recently, with the advent of Large Language Models (LLMs), there is now another option for teams to create datasets with which to fine tune. LLMs can be used to generate text themselves which is then used to guide smaller, specialised LLMs.

This has the benefit of being fast (much faster than human labelling), in-domain (unlike open-source datasets), and audit-able (the generated dataset is readily available for review to check for bias, sensitive data, etc).

There have already been loads of ways reported to use large(r) language models to generate fine tuning data for smaller models. they have been used to generate queries for large document corpora for in-domain retrieval, to create dialogue to help align and fine-tune small chat models, and to pseudo-label large chunks of unlabelled data.

We are going to explore here a slightly different use case for building classifiers. We are going to turn a small amount of hand labelled data and an effective few shot classifier into an effective fine-tuned small language model that is easy to deploy at scale.

News topic classification

We use the example of classifying news stories into one of several categories, depending on the topic of news story. For example you may want to filter out any news stories related to entertainment and sport, and only keep news stories related to business and finance. This is a task that, although on its own may not be of great value, may be one of many steps in complex document processing pipelines that would be used to hunt down adverse news in financial portfolios, compliance check, or extract information from large document corpora.

Given that you know what classifications you are looking for, you can label a corpus of news stories, or you can search for natural sources of grouped news stories. One place you could find this is in the groupings of news articles on their page, e.g. the BBC news page as of today has categories for Cost of Living, Climate, Tech, Health and many others. This method was used to compile the AG News topic-classification dataset. The AG News dataset has short news articles stories grouped into one of four categories:

World News
Sport
Business
Science / Tech

What if you are interested in a very specific category that doesn’t appear in AG News, like Legal-Tech, Lab Grown Meat, or Space Travel. Here we demonstrate a way to quickly and cheaply create a small, curated dataset that can train a small language model (like BERT or similar) to classify large scale news databases, even for very specific news topics.

The process

Here we are going to use OpenAI’s text-curie-001 model for all experiments for the sake of demonstration. It provides a good mix of speed, cost-effectiveness, and performance for the dataset generation. This can be replaced by a number of freely-available open source models, like GPT-NeoX or Flan-T5, to reduce cost even more and avoid commercial licencing issues with OpenAI generated data.

What does the process of dataset generation look like using a LLM. We use a three-step process:

Creating a few-shot prompt by hand-labelling examples for few-shot generation and classification.
Creating an initial GPT-generated dataset
Curating the dataset by asking GPT to ‘check’ its own generations are consistent.

We will aim to generate a small, synthetic version of the AG News dataset, so we can have a real-world test set to compare against.

**A zoomed out view of how we created the synthetic AG News dataset**

Step 1: The prompt

This isn’t an article on prompt-engineering, so I am just going to show the prompt we used.

A list of news articles organised by topic.

Topic = Business
Text = UK trade deficit grows The UK trade deficit widened to 5.3bn in October after imports hit their highest level since records began, official data showed today.

Topic = Business
Text = Crude Oil Falls a Third Day in London as China May Slow Demand Crude oil fell in London, leading to its biggest three-day drop in a month, on rising US stockpiles and concern China is taking steps to slow demand in the world's second-largest oil consuming nation.

Topic = Business
Text = Fund insiders: Lavish gifts abound Mutual fund companies in Boston say they have strict prohibitions against employees getting the kind of wining and dining and luxe entertainment that government regulators are investigating at Fidelity Investments, but industry officials and attorneys in the field contend such extravagance is nonetheless common in the investment world.

Topic = Business
Text =

This gets sent to the OpenAI API, and the response will be (hopefully) a coherently written news-sounding article that could reasonably be classified into the Business topic. We have three examples for each of the 4 topics in AG News, and have a separate prompt for each. This means that, to recreate these prompts for your own topics not covered by AG News, you only need to hand-label 3 examples per topic, which is certainly easier than the several thousand you might need to label a whole dataset from scratch…

Topic = Business
Text = Google to release new search engine Google is releasing a new search engine called ""Google Search Plus."" The new engine will give users more detailed results, as well as the ability to see how other users have searched for specific terms

Step 2: Initial dataset

We generate a sample of 10,000 texts, 2500 for each topic. This dataset has no guarantee of quality, as generative language models are known to hallucinate.

Step 3: Filtering

We find in the dataset examples that are sort of ambiguous, and not obviously part of one of the categories. To avoid bad examples, or poorly-formatted texts we go through the dataset again, this time getting the LLM to few-shot classify the very same texts that it generated, with the gold examples in the prompt. Here is an example of the prompt used for classification, using the generated text given above:

A list of news articles organised by topic. Possible topics are Business, Sport, World News, Science/Tech

Text = UK trade deficit grows The UK trade deficit widened to 5.3bn in October after imports hit their highest level since records began, official data showed today.
Topic = Business

Text = New Orleans Bowl Teams: North Texas (7-4) vs. Southern Mississippi (6-5) When: Tuesday, 7:30 p.m. Where: New Orleans TV: ESPN At a glance: The Mean Green are really mean, and not green, other than freshman running back Jamario Thomas, who has rushed for 1,709 yards and scored 17 TDs. Thomas is worth watching and is the best running back not involved in the ...
Topic = Sports

Text = Coronation Begins for Cambodia's New King (AP) AP - Cambodia began three days of celebrations Thursday as its new king, former ballet dancer Norodom Sihamoni, prepared to take over the throne from his father, who was adored by his people for more than 60 years.
Topic = World News

Text = Report: Amount of fine-particle pollution drops significantly LOS ANGELES A new Environmental Protection Agency report says concentrations of dangerous air pollutants have declined in Southern California in the last five years.
Topic = Science/Tech

Text = Google to release new search engine Google is releasing a new search engine called ""Google Search Plus."" The new engine will give users more detailed results, as well as the ability to see how other users have searched for specific terms
Topic =

The response will be (again hopefully) one of the 4 topics specified at the top of the prompt. This is used to create pseudo-labels for each of the 10,000 previously generated samples. We keep a sample only if the label used the generate the sample, and the GPT pseudo-label are the same. This gives us some confidence that the dataset contains only relatively good, clean examples of each of the 4 categories.

After this step of filtering, we are left with ~5600 samples. The total API cost of this processing came to around $10.

We train two separate BERT models, one on the cleaned, synthetic dataset, and one on a randomly sampled subset of the AG News training set such that each label is equally weighted. We compare their performance on the AG News labelled test set to see how good the synthetic dataset is compared to carefully constructed natural data.

Results

Below are the accuracy of the BERT as trained on the (truncated) AG News Dataset, and the model trained on the synthetic, few-shot generated news topic classification data set. We also show two other methods for utilising a small number of hand-labelled samples: few-shot classification using generative models, and setfit — an innovative method using contrastive learning to create negative and positive examples out of all pairs of samples. The pairs are used to generate useful embeddings that can be used as the input to a classifier.

The results for 3 of the 4 categories are remarkably good with Synthetic Bert. With just a handful of labelled examples and $10 worth of API calls, we are able to get within 5–6% of the performance of a model trained on the same number of gold-standard data. With the investment of more money, these results are likely to improve even further.

The Bert base model has a paltry 110M params, and can be deployed cheaply at massive scale on consumer GPUs, CPUs, and edge devices with a little bit of inference optimisation. This is a potent form of compression that simultaneously massively decreases costs and latency, while significantly improving performance.

Curie few-shot heavily skews towards classifying almost all articles as World News. 68% of all articles were classified as World News, when only 25% of data is in the World News category. The self-supervised dataset filtering is an important step to generating consistent data, and helps Curie 'bootstrap' itself to better performing models trained on its outputs. The Curie performance could certainly be improved with more prompt engineering and giving it more examples, but this would then increase the cost in the long run of using this system in production.

Summary of the results of Bert-base trained on the original AG news dataset (Natural Bert), the Bert trained on synthetic data (Synthetic Bert), text-curie-001 3-shot, and setfit based on sentence-transformers/paraphrase-mpnet-base-v2

Digging into the regression:

The most notable drop in performance of Synthetic Bert is in the Science/Tech category, where the synthetic data model drastically under-performs. Let's try and look at a few of the misclassified examples and see if there are any patterns we can identify.

Firstly, the synthetic dataset seems to have been missing any notion of video games entirely. The following texts, all about video games, were classified as Sport when they should have gone into Science/Tech:

Dare you fight the possessed tomatoes? Quirky, stick-figure "Kingdom of Loathing" shows continued promise of independent game-writing.

Fable; Nascar 2005: Chase For The Cup Fable comes with a big reputation behind it -- it was developed by Peter Molyneux, creator of such involved, engrossing games as Populous and Black and White.

Rocky Legends; Tony Hawk's Underground 2; Nisus Writer Express 2.0; Surfsaver 6 This is the second Rocky video game in two years -- even though it's been 14 years since the last "Rocky" flick.

Similarly, some stories are ambiguously placed in a single label, but could reasonably be placed in multiple categories. The following story is (pretty reasonably) classified by synthetic-BERT as Sport when the label suggests it is Science/Tech:

How to take the perfect penalty A sports psychologist says how footballers should prepare themselves for the high-pressure penalties.

Finally there seems to be a large number of stories about the business of “Tech” companies, which are classified in the Science/Tech section but which may very well be classified as Business stories:

IBM to hire even more new workers By the end of the year, the computing giant plans to have its biggest headcount since 1991.

Vodafone hires Citi for Cesky bid (TheDeal.com) TheDeal.com - The U.K. mobile giant wants to find a way to disentagle the Czech wireless and fixed-line businesses.

Oracle Overhauls Sales-Side Apps for CRM Suite (NewsFactor) NewsFactor - Oracle (Nasdaq: ORCL) has revamped its sales-side CRM applications in version 11i.10 of its sales, marketing, partner relationship management and e-commerce application.

IBM Buys Two Danish Services Firms IBM said Tuesday it has acquired a pair of Danish IT services firms as part of its effort to broaden its presence in Scandinavia. As a result of the moves, IBM will add about 3,700 IT staffers to its global head count. Financial terms of ...

These are all incorrectly classified by the synthetic BERT Model. The great thing about this approach, is that armed with this knowledge, the performance can easily be improved by generating more synthetic text, but emphasising that Science/Tech ought to include tech companies. And if that was not suitable for the downstream use case, the LLM can be prompted to do the opposite. This is a really flexible way to generate data faster, cheaper, and more transparently than ever before.

Conclusion

The process of collecting, cleaning, and curating labelled datasets for machine learning projects, particularly in the realm of natural language processing, can be a significant bottleneck. However, with the emergence of Large Language Models (LLMs), there are new possibilities for generating datasets more efficiently and cost-effectively.

In this article, we showcased a unique approach to news topic classification using LLMs. By combining hand-labelled examples with few-shot generation techniques, we were able to create a curated dataset and train small, fine-tuned language models. The results demonstrated that with just a handful of labelled examples and minimal API calls, we achieved performance comparable to models trained on much larger, naturally labelled datasets. This approach provides a powerful solution for quickly and inexpensively generating specialized language models that can be deployed at scale, opening up new avenues for complex document processing, compliance checks, and information extraction tasks. As the field continues to advance, leveraging LLMs for dataset generation will undoubtedly play a crucial role in accelerating machine learning projects and overcoming the pain points of data collection.

About TitanML

TitanML enables machine learning teams to effortlessly and efficiently deploy large language models (LLMs). Their flagship product Takeoff Inference Server is already supercharging the deployments of a number of ML teams.

Founded by Dr. James Dborin, Dr. Fergus Finn and Meryem Arik, and backed by key industry partners including AWS and Intel, TitanML is a team of dedicated deep learning engineers on a mission to supercharge the adoption of enterprise AI.

Check out the Discord, Colab, and Website

‍

Deploying Enterprise-Grade AI in Your Environment?

Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack

Get started