# Fine tuning GPT-3 to sound like a podcast host

## Executive summary

I developed four fine-tuned versions of OpenAI’s GPT-3 models to sound like Russ Roberts who is the host of the podcast EconTalk, using more than 10 years of episode transcripts. I calibrated my experiments for a \$40 USD budget (\$10 per model) and tested the fine-tuned models on 12 quintessential Russ/EconTalk questions.

This process led to seven key insights:

1. ChatGPT gives decent answers to EconTalk questions without any fine-tuning (~50% accuracy).
2. A fine-tuned text-davinci-003 model gives even better results (80% accuracy), and answers questions with impressive Russ-like diction when trained on only 83K tokens (see examples).
3. However, all models require prompt engineering and repeated querying to get the “best” answer.
4. The curie model has poor zero-shot accuracy, and doesn’t do much better after fine-tuning, although its way of speaking has a podcast-like feel.
5. The ada and babbage models give nonsensical answers to most prompts even with training on the entire corpus (4-5 million tokens).
6. With OpenAI’s pricing, the scaling laws clearly favor model size over the amount of data (i.e. the curie model fine-tuned on ~10x data does much worse than the davinci model which is ~10x as large).
7. Overall, a fine-tuned davinci model with more resources (the one used in this experiment was trained on <2% of the corpus for cost reasons) would likely provide a good impersonation of the style and type of answers a podcast host would give.

The rest of this post provides more details on the development of this model and the experimental results. Interested readers can create their own versions of EconChatR by cloning this repo and running the main pipeline.

## (1) Background

People are very excited about ChatGPT.[1] ChatGPT uses the 3rd generation of large language models (LLMs) from OpenAI (GPT 3.5), that has been able to produce human-like performance for a variety of natural language tasks. LLMs use many multi-headed attention layers (a type of architecture first proposed by Google researchers in 2017) which are more amenable to parallelization and hence GPU acceleration. The davinci version of GPT-3.5 (which powers ChatGPT) is very large,[2] and would be prohibitive to run on a local machine.[3]

From what OpenAI has published, GPT-3 was trained on a large corpus of mainly English sources using a very simple training framework: predict the next token given a context. What’s fascinating is how such a simple self-supervised training process yields such impressive results on a variety of tasks from language translation to summarization to logical reasoning. If you wanted to re-create ChatGPT from scratch you would likely need a supercomputer cluster with ~10,000 GPUs and ~285,000 CPU cores (~\1 billion USD to rent), spend months training the model, and then spend months/year labeling the output of the model to further hone its abilities. One of OpenAI’s “secret ingredients” is an unspecified amount of resources spent doing Reinforcement Learning from Human Feedback (RLHF). Basically this means having humans interact with the chatbot, label its output as either correct or incorrect, and then further refine its parameters using this supervised learning framework. ChatGPT has shown amazing performance in its ability to write code, write academic papers, answer most homework questions, write entire syllabi, and come up decent rhyming poems. ChatGPT's cherry tomato poem is pretty good! ## (2) Introduction to fine-tuning Even with unprecedented scale, training, and RLHF refinement, ChatGPT still shows a tendency to “hallucinate”, giving nonsensical answers, made-up facts, and repeated sentences. One way to address this issue is to fine-tune a version of GPT-3 on a custom dataset that updates the model weights in a way that makes it more likely to give answers consistent with the task you have in mind. For example, we may want the model to give medically specific and accurate answers to questions related to depression (I feel compelled to mention this is ethically & legally risky!). This can be done with a new collection of prompt/completion data points: {"prompt": "<prompt text #1>", "completion": "<ideal generated text #1>"} {"prompt": "<prompt text #2>", "completion": "<ideal generated text #2>"}  This process can be generalized to having the model give answers which “sound” like a person if there are enough examples (“completion” points) of that person responding to different questions. My goal with EconChatR was to use OpenAI’s native fine-tuning functionality to give answers that would sound more like Russ Roberts, the host of EconTalk. For example, below are two prompt-completion examples where the prompt is a (truncated) statement from an EconTalk guest (with “Russ Roberts responds:” added as the suffix) followed by Russ’ subsequent comment as a completion. {"prompt":"... So, here I have one for the city of Shanghai, where they claim that they have eliminated 1, 367, 440 sparrows. You think, 'Really, was there somebody counting these sparrows? I doubt that very much. Russ Roberts responds:", "completion":"Yeah. Well, it was probably 443 at the end. But, yeah, there's a terrible rounding and inaccuracy. I'm sure long-time listeners will remember my joke about macroeconomics: 'How do you know an economist has a sense of humor? They use decimal points. So, here's the same kind of ridiculous--tragic, really. There's a comic element to it. But it's tragicomic."} {"prompt":"... It's like, 'Why are things done in a certain way? It's like, 'Well, try doing them some other way and you'll quickly find out why they're done in that way. This is a very classic phenomenon that people run into. Russ Roberts responds:", "completion":"It's the Chesterton fence argument, and you think you know why the fence is there, but it evolved for some reason that you don't know of and you should start--your default should be, 'There's a reason it's there and I don't understand it.\n"}  After we’ve created a fine-tuned model we should be able to ask our model questions like: “You often say that economists have a sense of humor because they use decimal points, why is that? Russ Roberts responds:”, with the prompt suffix “reminding” our model the sort of dialogue we expect to receive back. ### Baseline results Before spending money to fine-tuning models, I subjectively evaluated how well ChatGPT does “out of the box” (i.e. zero-shot) on 12 quintessential Russ’isms and EconTalk facts (inspired by the EconTalk drinking game): 1. Does Russ Roberts, the host of EconTalk, mention the “Bootlegger and Baptist” theory often? 2. As a proponent of free markets, is Russ Roberts attracted to the “Bootlegger and Baptist” theory? 3. How skeptical is Russ Roberts, the host of EconTalk, of empirical research published in academic journals? 4. Does Russ Roberts, the host of EconTalk, believe that “data can speak for itself” as a general principle? 5. Finish the following sentence: “Man desires not only to be loved 6. Finish the following sentence: “The curious task of economics 7. Why is “skin in the game” a deep insight? 8. In what way does understanding the challenge around “creating a prairie” align with F A Hayek’s famous maxim about the curious task of economics? 9. Why is it harder to “create a prairie” then it seems? 10. As a free market economist, why is Russ Roberts drawn to the idea of the “Chesterton Fence”? 11. Russ Roberts often says that “Macroeconomists have a sense of humor” why is that? 12. Who are some guests from EconTalk from the year 2014? As Table 1 shows below, the vanilla ChatGPT does well for half of the questions: 1, 2, 6, 7, 8, and 10. I would give this model a score of 6/12 (50%). Questions 5 and 11 shows that the model is hesitant to give answers on behalf of specific people, and has a surprisingly hard time remembering the Adam Smith quote. The model also provides inaccurate answers for question 12, providing some real EconTalk guests, but not necessarily those who were there in 2014 (for example Ben Bernanke has not been a guest, and Tyler Cowen was on EconTalk in 2013/2017 but not 2014), and strangely adding that Russ was a guest on his own podcast. All this suggests that while the model is able to recapitulate certain concepts and ideas that are discussed on EconTalk, it doesn’t always give the right answer and it doesn’t sound like Russ (partly because it refuses to speak on his behalf). This means there could be room for improvement with a fine-tuned model. Table 1: Completions to baseline prompts ## (3) Data and processing The first step in any data science project is to gather and clean a dataset and then properly format it. A full list of EconTalk episodes was obtained from the xml file hosted on EconLib. A simple web-scraper was then used to extract the XPath as identified by the SelectorGadget tool available as a chrome plugin (see example below). EconTalk episodes transcripts can be obtained from the episode page After the raw transcripts were downloaded, it become clear that the earlier episodes did not have verbatim transcripts and hence could not be used for the prompt/completion style formatting. The data was therefore subset to the oldest episode that had a full episode transcript which was from early 2012. A total of 573 episode transcripts remained which ranged from Jan 11th 2012 to December 20th 2022 (as of January 4th 2023). The second step of the pipeline was to ensure that the dialogue was aggregated to alternate text segments between Russ and his guest (see example below), which would later form the basis of the prompt/completion. Data was structured to alternate between Russ & the guest The final processing step of the pipeline carried out some basic string cleaning procedures including: 1. Removing repeated words 2. Making sure there was right number of spaces around punctuation 3. Removing transcript artifacts 4. Removing excess white space 5. Removing special characters At this point there was a total of 31046 unique Russ/guest dialogue pairs. However not all 31K prompt/completion values were informative. For example, many dialogue exchanges amounted to a handful of words which had no information without the broader context as the two examples show below. {"prompt":"Okay.","completion":"But, Mike might be untouchable. We'll see."} {"prompt":"Yeah, I always hated that.","completion":"Which part?"}  For a prompt/completion pair to be kept, I imposed the following token requirements: 1. Russ’ statements had to have at least 20 tokens 2. The guest’s statements had to have at least 60 tokens 3. The combined token length of Russ and the guest must be less than 2000 tokens While this reduced by number of dialogue pairs by two-thirds to 10807, slightly less than 20% of tokens were removed. ## (4) Model training and cost ### Training set to match budget OpenAI has model-specific costs for both training and inference. For example the simplest model (text-ada-001) costs \0.0004 per 1000 tokens whilst the most sophisticated (text-davinci-003) costs 75x more at \$0.03 per 1000 tokens. While this seems like a small price to pay, consider that the ~10K cleaned prompt/completions have a total of 4,818,101 tokens. This means that single epoch on all data prompt/completion pairs would cost \$145 USD for the most expensive option. While this would be a trivial amount for a commercial endeavor, EconChatR is a research project on a shoe string budget.

Instead, for each model I calculated the number of tokens that would support a \$10 USD training cost for 4 epochs. This amounted to a data reduction of 15%, 83%, and 98% for the babbage, curie, and davinci models (with the ada model being cheap enough to train for <\$10 without a data reduction). For example, as Table 2 shows below, the davinci model could only be trained on 83K tokens.

Table 2: Data reduction needed for 10 USD cost

### Curated data choices

Because I knew the prompts I would be giving in the experiment phase, I wanted to look for dialogue pairs that had key words I thought would most closely align with a Russ-like response. I came up with 12 different string matches to find:

1. bootlegger
2. baptist
3. empirical
4. skeptical
5. prairie
6. deep
7. Chesterton fence
8. decimal point
9. macroeconomist
10. macroeconomic
11. regression
12. speak/data

Each string match was associated with a list of conversation pairs. These were then ordered by the number of tokens from smallest to largest. I wanted to make sure that each string match was as equally represented in each model as possible. To do this, I used two techniques.

First, conversation pairs were ordered and iteratively selected from smallest to largest number of tokens to get as many pairs as possible before the token limit was reached.

Second, before a string match’s conversation pair was selected (e.g. the 10th conversation pair of “speak/data”), all other string match types had to have the same number of their conversation pairs selected (e.g. “bootlegger” to “regression” had to have 10 conversation pairs). This meant that no conversation string match could ever have more than one over another, unless the other one had run out of matches.

This also ensured that larger models used a data subset of smaller models. In other words, all data points found in davinci are found in curie, and all data points found in curie were also in the training set of babbage, etc.

Training samples were chosen to maximize coverage for 12 string match types

Overall, this left a total of 10807 (Ada), 9186 (Babbage), 1514 (Curie), and 204 (Davinci) prompt/completion pairs for each of the models to be trained on for 4 epochs.[4]

### Fine-tuning API

Training the custom GPT-3 model was fairly simple:

1. The model-specific dataset was uploaded to OpenAI using the openai.File.create command.
2. A fine-tuning call was made for each of the models and their associated dataset using the openai.FineTune.create command.

### Footnotes

1. Here are some fairly emblematic headlines that have come out on social and traditional media platforms: “Google is done. Here’s why OpenAI’s ChatGPT Will Be a Game Changer”, “ChatGPT is a game changer for artificial intelligence”, “ChatGPT is a GAME CHANGER!”.

2. The model has ~170B parameters and was trained on about 300 billion tokens (where 5 tokens averages about 4 words).

3. It would likely take 8 high-end GPUs to run a model of this size on a local machine (each GPU would cost ~\$15K). Furthermore, it has been estimated that GPT-3 would have cost \$5 million USD to train and one popular estimate has pegged the cost of running ChatGPT at \\$3 million USD a month.

4. Note that these numbers are slightly different than what is seen above for the reason that the number of pairs (n_pairs) column is based on an estimate rather than an actual value, since some pairs will have more/fewer tokens than others. The final number quoted is based on the model-specific training set that gets generated by rank-ordering the 12 different string match types from the smallest to the largest number of tokens.

Written on January 6, 2023