The spelled-out intro to language modeling: building makemore

Andrej Karpathy
7 Sept 2022117:45

TLDRThis video script introduces 'makemore', a GitHub repository designed to generate new data entries based on existing datasets. Using a character-level language model, 'makemore' is trained on a large set of names to produce unique, name-like outputs, which can be useful for applications like baby naming. The script explains the fundamentals of building and training a bigram language model, both through direct counting of character sequences and via a neural network approach using gradient-based optimization. The goal is to minimize the negative log likelihood loss to improve the model's predictive accuracy. The script also touches on model smoothing, regularization, and how to sample new names from the trained model.


  • 🌟 The 'makemore' project is an extension of 'micrograd', aiming to create a repository for generating new data instances based on provided examples.
  • 📝 'makemore' uses a character-level language model to generate new names or text that resembles the input data, which can be useful for tasks like generating baby names.
  • 🔤 It operates on a large dataset of names, 'names.txt', containing 32,000 names found from government websites, to learn patterns and generate new, unique names.
  • 🤖 The model is trained to predict the next character in a sequence, treating each line as a sequence of individual characters and learning from these patterns.
  • 📚 The script discusses implementing various character-level language models, from simple bi-gram models to modern transformer models like GPT-2.
  • 🔢 A bi-gram language model is introduced, which predicts the next character based on the previous one, creating a simple yet foundational model for text generation.
  • 📈 The process of counting bi-grams in the dataset and storing them in a two-dimensional array is explained, which is crucial for the model to learn and predict character sequences.
  • 📊 The use of visualization tools like matplotlib is highlighted to better understand the structure of the bi-gram counts array.
  • 🔧 The script details the process of normalizing the counts into probabilities, which is essential for sampling from the model to generate new text.
  • 🔬 The concept of negative log likelihood is introduced as a measure of the model's quality, with lower values indicating a better fit to the training data.
  • 🚀 The script concludes with a discussion on training the model using gradient-based optimization, highlighting the flexibility and scalability of the neural network approach compared to the explicit counting method.

Q & A

  • What is the purpose of the 'makemore' repository mentioned in the transcript?

    -The 'makemore' repository is designed to generate more of anything it is trained on, such as names, using a character-level language model. It can be used to create unique names, potentially for new babies, by learning from a dataset and generating new, name-like sequences of characters.

  • How does the 'makemore' model generate new names?

    -The 'makemore' model generates new names by training on a dataset of names and learning the patterns and sequences of characters. It then uses this knowledge to predict and create new sequences of characters that sound like names but are unique.

  • What is a character-level language model, and how does it differ from other types of language models?

    -A character-level language model operates on the level of individual characters in a sequence, predicting the next character based on the previous ones. It differs from word-level or sentence-level models, which may consider entire words or larger linguistic structures.

  • What is the significance of using a special start and end token in the 'makemore' model?

    -The special start and end tokens (like 's' for start and 'e' for end) are used to signal the beginning and end of a sequence to the model. They help the model understand the boundaries of the data and improve its ability to generate coherent sequences.

  • Can you explain the concept of 'bi-gram' in the context of language modeling?

    -A bi-gram is a pair of characters that appear consecutively in a given text. In language modeling, bi-gram models predict the likelihood of a character following another based on the frequency of their occurrence in the training data.

  • What is the role of the counts matrix 'n' in the 'makemore' model?

    -The counts matrix 'n' stores the frequency of bi-gram occurrences in the training data. It is used to calculate the probabilities of characters following one another, which is essential for generating new sequences.

  • How is the 'makemore' model trained using the negative log likelihood loss?

    -The model is trained by minimizing the negative log likelihood loss, which measures the model's ability to predict the training data. Lower loss values indicate a better fit of the model to the data.

  • What is meant by 'model smoothing' in the context of language models?

    -Model smoothing is a technique used to prevent the model from assigning zero probability to certain character sequences. It involves adding a small constant to all counts, making the model more robust and preventing infinite loss values for unlikely but possible sequences.

  • How does the neural network approach differ from the explicit counting approach in training the 'makemore' model?

    -The neural network approach uses a gradient-based optimization method to adjust the weights of the network in order to minimize the loss function. In contrast, the explicit counting approach directly calculates probabilities from the frequency of bi-grams in the data without the need for optimization.

  • What is the importance of the softmax function in the context of the 'makemore' model?

    -The softmax function is used to convert the logits (log counts) output by the neural network into a probability distribution. It ensures that the output values are positive and sum up to one, representing the probabilities of each possible next character.



🚀 Introduction to 'Make More' Project

The speaker introduces the 'Make More' project, an endeavor to create a repository that generates more of anything given a dataset. The project's focus is on generating unique names from a dataset of 32,000 names sourced from a government website. The aim is to assist in generating unique, name-like outputs, which could be useful for naming babies or other entities. The speaker plans to develop the project step-by-step, starting with character-level language modeling, and eventually moving to word and image generation.


🧠 Building a Bi-gram Language Model

The speaker discusses the creation of a bi-gram language model, which predicts the next character in a sequence given the previous one. The model is trained on a dataset of names, and the speaker explains the process of extracting bi-grams from the dataset. A special start and end token are introduced to handle the beginning and end of words. The speaker also demonstrates how to visualize the bi-gram data using Python's zip function and emphasizes the importance of considering the statistical structure within words for accurate modeling.


📊 Counting and Storing Bi-gram Frequencies

The speaker explains how to count the occurrences of each bi-gram in the dataset using a dictionary to store the frequencies. The counts are then transferred into a two-dimensional array, or tensor, using PyTorch for efficient manipulation. The process of sorting and visualizing the bi-gram frequencies is also covered, with the goal of understanding the most and least common character sequences in the data.


📚 Transitioning to a 27x27 Count Array

The speaker transitions from using two special tokens to a single special token in the bi-gram model, resulting in a 27x27 count array. The array is visualized and analyzed to show the distribution of character sequences. The speaker also discusses the inefficiency of using two tokens and the decision to use a single token at position zero, offsetting the alphabetic characters to start at index one.


🔄 Sampling from the Bi-gram Model

The speaker describes the process of sampling from the bi-gram model to generate new names. This involves starting with a special start token and iteratively sampling the next character based on the current character's probability distribution. The use of PyTorch's `multinomial` function for sampling and the importance of using a deterministic generator for consistent results are highlighted.


🤖 Implementing the Bi-gram Model in Code

The speaker provides a detailed code implementation of the bi-gram model, explaining each step from initializing the counts matrix to sampling new names. The code includes the use of a generator for deterministic results, the creation of a probability distribution from the counts, and a loop for iterative sampling of characters to form new names.


🔍 Evaluating the Bi-gram Model's Quality

The speaker discusses the evaluation of the bi-gram model's quality using the negative log likelihood loss function. The process involves calculating the probability assigned by the model to each bi-gram in the training set and then computing the log likelihood and its negative. The goal is to minimize this loss function to improve the model's predictions.


📈 Training a Neural Network for Language Modeling

The speaker introduces an alternative approach to language modeling using a neural network framework. The focus is on training a neural network to predict the next character in a sequence given the current character. The process involves creating a training set of bigrams, one-hot encoding the inputs, and using the neural network to output logits, which are then transformed into probability distributions for evaluation against the training labels.


🔧 Fine-tuning the Neural Network with Gradient Descent

The speaker explains the process of fine-tuning the neural network's weights using gradient descent. This involves running a forward pass to calculate the loss, performing a backward pass to calculate the gradients, and then updating the weights in the opposite direction of the gradients to minimize the loss. The speaker emphasizes the importance of differentiable operations in the neural network for effective training.


🔄 Efficient Sampling from the Neural Network Model

The speaker demonstrates how to sample from the trained neural network model to generate new sequences of characters. The process involves encoding the input character into a one-hot vector, passing it through the neural network to get logits, normalizing the logits to get a probability distribution, and then sampling from this distribution to predict the next character.


🎯 Conclusion and Future Outlook

In conclusion, the speaker summarizes the process of training a bigram character-level language model using both a count-based approach and a neural network approach, showing that both methods can yield the same results. The speaker also discusses the flexibility and scalability of the neural network approach, hinting at future expansions to more complex models including transformers.



💡Language Modeling

Language modeling is the task of predicting the probability of a sequence of words. In the context of the video, the focus is on character-level language modeling, where the model learns to predict the next character in a sequence given the previous characters. This is foundational for generating text that resembles natural language, as showcased by the 'makemore' repository's ability to generate new, unique names.


A dataset in the video refers to a collection of data used for training the language model. Specifically, 'names.txt' is a dataset comprising 32,000 names that the model uses to learn patterns and generate new name-like sequences. The dataset is crucial as it provides the examples from which the model extracts statistical information.


Training in the video script pertains to the process of teaching the model to understand and replicate patterns found in the dataset. By training on 'names.txt', the model learns to generate new names that have not been seen before, which demonstrates the model's capability to create novel outputs based on learned patterns.

💡Neural Network

A neural network is a set of algorithms modeled loosely after the human brain that is designed to recognize patterns. In the video, the neural network is trained to be a character-level language model, learning to generate new names. The script mentions that various types of neural networks, from simple bi-gram models to modern transformers, will be implemented to improve the model's predictive capabilities.

💡Bi-gram Model

A bi-gram model, as explained in the script, is a type of language model that predicts the next character based on the current character. It considers only the immediately preceding character to make its prediction. The script uses the bi-gram model as a starting point to build a simple language model before moving on to more complex models.


Character-level refers to the granularity at which the language model operates. In the video, the model is character-level because it treats each individual character as a unit and learns to predict the next character in a sequence one at a time. This is contrasted with word-level models that would predict entire words or phrases.


The transformer is an advanced type of neural network architecture introduced in the script, known for its effectiveness in handling sequential data. The video mentions building a transformer equivalent to GPT-2, which is a significant step up in complexity and capability from the initial bi-gram model, indicating the script's progression towards more sophisticated language modeling.

💡Jupyter Notebook

A Jupyter Notebook is an open-source web application that allows creation and sharing of documents that contain live code, equations, visualizations, and narrative text. In the video, the presenter starts with a blank Jupyter Notebook page to demonstrate the step-by-step process of building the 'makemore' language model.

💡Special Tokens

Special tokens in the context of the video are unique characters used to signify the start and end of a sequence. The script mentions using a special start token and an end token to help the model understand the boundaries of the sequences it is generating, which is an important aspect of sequence generation tasks.

💡Negative Log Likelihood

Negative log likelihood is a loss function used to evaluate the performance of a model in the video. It measures how well the model's predictions match the actual data. The script explains that a lower negative log likelihood indicates a better model, as it means the model assigns higher probabilities to the correct next characters in the training set.

💡Model Smoothing

Model smoothing is a technique mentioned in the script to prevent the model from assigning zero probability to certain character sequences. It involves adding a small constant to all counts before normalizing, which results in a more uniform probability distribution. This helps to avoid infinite loss values when sampling from the model.


Introduction to building 'makemore', a language model repository on GitHub.

Makemore generates new data entries similar to the input, such as unique name suggestions.

The use of a large dataset of 32,000 names for training the model to produce name-like outputs.

Explanation of character-level language modeling and its approach to sequence prediction.

Implementation of various neural network models for language modeling, from bi-grams to transformers.

Building a bi-gram language model as a starting point for understanding language model mechanics.

The process of counting bi-gram occurrences to establish character sequence probabilities.

Utilization of Python's 'zip' function to iterate over pairs of characters in a word.

Introduction of special start and end tokens to frame the bi-gram modeling context.

Conversion of character sequences into a numerical format for processing with PyTorch tensors.

Use of 'torch.tensor' for creating and manipulating multi-dimensional arrays of counts.

The importance of understanding broadcasting in PyTorch for efficient tensor operations.

Transformation of raw counts into probabilities for language model predictions.

Sampling from the probability distribution using 'torch.multinomial' for name generation.

Discussion on the limitations of a simple bi-gram model and the need for more complex models.

Introduction of a neural network approach to language modeling as an alternative to counting.

Conversion of bigram data into a training set for a neural network, using one-hot encoding.

Explanation of the softmax function and its role in producing probability distributions from logits.

The use of negative log likelihood as a loss function for training the neural network.

Demonstration of gradient-based optimization to minimize loss and improve model predictions.

Comparison between the explicit counting method and the implicit neural network optimization.

Outlook on scaling up the model to handle more complex language modeling tasks.

Final thoughts on the flexibility and scalability of the neural network approach for language modeling.