Most Popular Largest Transformer Models for Different Tasks

25 September, 2023

The AI revolution is in full swing, and artificial neural networks are turning the once-impossible into reality. Machine learning models are now capable of tasks like conversational AI, generating media content, coding, and more. Transformers, a powerful architecture for working with sequences, have played a significant role in these AI advancements. In this article, we'll take a closer look at the world of Transformers and the wide range of models that have emerged in recent years.

The Transformer Impact

Transformers brought a game-changing "self-attention" mechanism, propelling AI forward. They also ushered in the era of transfer learning, where pretrained models became the norm for both industry and academia. Over the past 4-5 years, the AI community has focused on:

Training new models on fresh data.

Enhancing Transformer architecture.

Optimizing self-attention.

Combining Transformers with other architectures.

Handling longer sequences.

Innovating fine-tuning techniques.

Applying Transformers to non-text data.

Building multimodal models.

Navigating the AI Universe

Keeping up with the flood of papers, models, and approaches is tough even for experts. 2022 was a year of innovation, and 2023 continues the trend. To make sense of it all, I have structured this article by model types.

Popular Large Language Transformer Models

BERT-like Models for Text

In this section, we'll explore models based on the Transformer encoder architecture, designed for various text-related tasks like vectorization, classification, sequence labeling, QA (Question Answering), and NER (Named Entity Recognition). These models are versatile, working across different languages and settings.

1. BERT

Model Info: Google (2018)

Article: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"

Architecture: Transformer Encoder

Parameters: 100M, 340M

BERT uses Wordpiece tokenization with a 30K-word vocabulary. Its input embedding comprises three vectors: token vector, trainable positional vector, and segment vector (1st or 2nd text). The model takes the CLS token embedding, embeddings of the 1st text, and embeddings of the 2nd text as inputs.

BERT is trained on two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, 15% of tokens are masked, 80% replaced with [MASK], 10% replaced with random tokens, and 10% remain unchanged. The model predicts the correct token, with loss calculated only on these 15% tokens. In NSP, the model predicts whether the 2nd text follows the 1st text. The prediction is made on the CLS token's output vector.

To speed up training, 90% of the time, sequences are 128 tokens long. Then, 10% of the time, the model trains on 512 tokens to obtain valid positional embeddings. The training data consists of 16GB of text.

2. RoBERTa

Model Info: Facebook (2019)

Article: "RoBERTa: A Robustly Optimized BERT Pretraining Approach"

Architecture: Transformer Encoder

Parameters: 115M, 360M

RoBERTa is an improved version of BERT with no drastic changes. It only trains on MLM (NSP is deemed less useful), and training sequences are longer but still capped at 512 tokens. Dynamic masking is used, where different tokens are masked on repeated passes over the same data, and a significantly larger training dataset is employed. RoBERTa carefully fine-tunes hyperparameters for better performance.

3. ALBERT

Model Info: Google (2019)

Article: "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"

Architecture: Transformer Encoder

Parameters: 12M, 18M, 60M, 235M

ALBERT's goal is to make BERT lighter without sacrificing quality. It shares parameters in different encoder blocks, demonstrating that layer-wise attention weights can be shared. It employs smaller input embeddings and larger hidden layer vectors compared to BERT. This is achieved by using an additional projection matrix at the network's input, allowing the decoupling of embedding size from hidden representation size.

As a result, ALBERT has 18 times fewer parameters and operates 1.7 times faster. It is trained on MLM and Sentence Order Prediction (SOP) tasks. In MLM, not only individual tokens but also N-grams are masked, addressing a limitation of BERT. ALBERT leverages XLNet and RoBERTa datasets for training.

These models represent a glimpse into the world of Transformer-based models for text, each with its unique approach and strengths.

GPT-like and T5-like Models for Text

In this section, we delve into the realm of generative models based on the full Transformer architecture or its decoder. These models have a wide range of applications, including dialogue generation, machine translation, logical and mathematical reasoning, code analysis, and text generation from text. The most substantial and sophisticated models often stem from the decoder architecture and excel in few-shot and zero-shot settings.

1. GPT-2

Model Info: OpenAI (2018)

Article: "Language Models are Unsupervised Multitask Learners"

Architecture: Transformer Decoder

Parameters: 117M, 345M, 762M, 1.5B

GPT-2 is a Transformer decoder that learns from the Causal LM task, predicting the next token based on the left context. It features slight architectural changes: each decoder block removes the cross-attention layer, and pre-LayerNorm is employed, placed at the input of each block and additionally at the output of the last self-attention layer.

Byte-level BPE tokenization with a 50K-word vocabulary is used, with avoidance of similar substrings ("dog," "dog!", "dog.") during vocabulary formation. The maximum sequence length is 1024 tokens. Layer outputs are cached for all previously generated tokens.

2. T5

Model Info: Google (2019)

Article: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Architecture: Full Transformer

Parameters: 60M, 220M, 770M, 3B, 11B

T5 is a full Transformer pretrained on MLM (15% token masking), with spans masked using unique codes ("", "", ...). It predicts a sequence of "spanspan..." on the output.

LayerNorm is applied before the input of self-attention and fully connected layers. Relative positional encoding is used. Positions are encoded with trainable embeddings, with each "embedding" being a scalar added to the corresponding attention weight logits (B is the matrix of these scalars).

The model accounts for 128 distances between tokens on one layer and zeros out the rest, enabling inference on longer sequences than in training. SentencePiece tokenization with a 32K-word vocabulary is used, followed by fine-tuning on various NLP tasks with appropriate prompts ("translate," "TL;DR," etc.).

3. BART

Model Info: Facebook (2019)

Article: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension"

Architecture: Full Transformer

Parameters: 140M, 406M

BART is another full Transformer, but it replaces ReLU with GeLU activation throughout. It learns to predict the original text from a noisy version (denoising autoencoder). The types of noise include token masking, token deletion, token filling (sampling token spans with length >= 0 and replacing them with a single MASK token), sentence shuffling, and token rotation (shifting, randomly choosing a token as the sequence start).

Byte-level BPE tokenization with a 50K-word vocabulary is used.

These models showcase the versatility and capabilities of Transformer-based generative models for text, offering unique features and applications.

Instruct Models for Text

In this section, we explore modernized versions of decoder models from the previous section. They have undergone fine-tuning for instruction-following tasks and, optionally, additional output correction methods like RLHF to improve response quality during dialogues and task completion.

RLHF Simplified

Basic Concept: RLHF enhances language models (LLMs) to follow instructions effectively. It starts with fine-tuning LLMs on datasets containing prompts and human-rated responses. These responses are used to train a reward model (RLM), which assesses the quality of text with a scalar value.

RLHF Process Overview:

Two copies of LLM: Trainable (A) and Fixed Reference (B).

Policy - LLM (A): Takes prompts as input and generates text.

Action space: All tokens in the vocabulary.

Observation space: All possible sequences of tokens as input.

Reward function: RLM's evaluation + penalty for significant deviation from (B).

For a given prompt, both models generate responses.

The reward model (A) is evaluated, and its parameters are updated.

Token sequences of texts (A) and (B) are compared based on KL-divergence.

Model (A) is penalized for deviating significantly from (B).

Updates are made using a specified algorithm (e.g., PPO or A2C).

1. InstructGPT

Model Info: OpenAI (2022)

Article: "Training language models to follow instructions with human feedback"

Architecture: Transformer Decoder

Parameters: 1.3B, 6B, 175B

InstructGPT adapts GPT-3 to perform high-quality instruction-following tasks. It starts with fine-tuning on datasets containing prompts and human-rated responses. This forms the base model, which is then enhanced using RLHF. The reward model ("reward") is GPT-3 6B. InstructGPT's success led to the creation of ChatGPT.

2. Alpaca

Model Info: Stanford University (2023)

Article: "Alpaca: A Strong, Replicable Instruction-Following Model"

Architecture: Transformer Decoder

Parameters: 7B, 13B

Alpaca builds on LLaMA's instruct fine-tuning. A significant aspect is the process of generating a dataset using GPT-3:

175 tasks-prompts with answers are generated by humans and fed into GPT-3, which produces new tasks. This iterative generation process includes a mix of human and previously generated tasks. GPT-3 classifies the generated Task-Inputs into classification and non-classification tasks, leading to different Input and Output generation strategies. Triplets are filtered based on quality and dissimilarity with existing data.

This process resulted in 52K unique triplets, used to fine-tune LLaMA 7B.

3. Vicuna

Model Info: Berkeley University, Carnegie Mellon University, Stanford University, UC San Diego (2023)

Article: "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality"

Architecture: Transformer Decoder

Parameters: 7B, 13B

Vicuna is one of the top-performing models achieved through fine-tuning LLaMA. It leverages 70K dialogues collected from the ShareGPT service, undergoes simple filtering post-parsing, divides long dialogues into shorter ones, increases the maximum sequence length to 2048, and employs gradient checkpoints and Flash Attention to manage memory usage. During fine-tuning, loss is computed only for model responses.

Vicuna's quality is evaluated automatically compared to competitors using GPT-4. In the 13B version, Vicuna outperforms Alpaca with a higher rating (92% vs. 76%).

The paper also introduces and utilizes a training manager that utilizes spot instances in the cloud to reduce training costs.

Models Oriented Towards Code

While large language models can analyze and generate code, there is a class of smaller models designed specifically for these tasks. They tackle challenges related to code generation from textual descriptions and vice versa, as well as code continuation and insertion.

1. PLBART

Model Info: University of California, Columbia University (2021)

Architecture: Full Transformer

Parameters: 140M

PLBART, based on BART-base, is trained with three denoising tasks where the goal is to recover the entire text:

The pre-training data includes English text descriptions and Python and Java code. Fine-tuning tasks include code generation, summarization, code translation, duplicate classification, and vulnerability detection. SentencePiece is used for tokenization with a 50K vocabulary.

2. CODEX

Model Info: OpenAI (2021)

Architecture: Transformer Decoder

Parameters: 12M, 25M, 42M, 85M, 300M, 679M, 2.5B, 12B

CODEX fine-tunes GPT-3 to improve Python code generation quality. It's evaluated based on the number of passed unit tests, with a metric called pass@k, where k samples are generated and the fraction of tasks solved with the best solution is assessed. Besides code from repositories, the model was further fine-tuned on individual functions (signature + docstring + code). It can also generate docstrings from code. To manage sequence length, the tokenizer includes tokens for varying lengths of whitespace sets.

3. CodeT5

Model Info: Salesforce (2021)

Architecture: Full Transformer

Parameters: 60M, 220M

CodeT5 is fine-tuned from T5 for three equally important code-related tasks. The input format is "CLS, text tokens, SEP, code tokens," and the text part can be empty. Each code token has a mask indicating whether it's an identifier (function name, variable) or not, determined through code syntactic parsing and tree analysis.

Pre-training tasks include:

Span prediction (1-5 tokens) like in T5, predicting MASK number and answer - Identifier label prediction for all tokens - Replacing all identifiers with MASK and predicting their names

The model is then trained to generate code from docstrings and vice versa concurrently. It uses its own tokenizer, and sequences on code data are 30-45% shorter than those of the original T5 tokenizer. The pretrained model is fine-tuned on summarization, code generation, and translation tasks, and quality is evaluated using BLEU metric.

Models for Computer Vision (CV)

The success of transformers in processing text sequences led to successful attempts to apply them to image processing tasks as well. Here are some models that incorporate Transformer architecture elements and tackle various computer vision tasks, including image vectorization, classification, segmentation, object detection, and more.

1. Image GPT

Model Info: OpenAI (2020)

Architecture: Transformer Decoder

Parameters: 76M, 455M, 1.4B, 6.8B

Image GPT is built upon the GPT-2 architecture and aims to generate high-quality features for computer vision tasks. Pixels are used as tokens, and to reduce the massive context size, the input is limited to a 32x32 resolution. The three eight-bit color channels (3x256) are approximated using a single nine-bit (1x512) channel. After pretraining as a Causal LM, fine-tuning or training a head on top of the model can be performed.

2. ViT (Vision Transformer)

Model Info: Google (2020)

Architecture: Transformer Encoder

Parameters: 86M, 307M, 632M

ViT is a popular and widely used model based on the transformer encoder. It divides an image into squares that become tokens. Pre-LayerNorm is used, and there is a CLS token for image class prediction during pretraining. The head is replaced with a new one during fine-tuning. Lower-resolution images are used during pretraining, and higher-resolution images during fine-tuning. Considering the fixed squares, this leads to longer sequences.

The model employs learnable positional embeddings, and for longer sequences, 2D interpolation of embeddings is used, taking square position into account.

3. BEiT (BERT Pre-Training of Image Transformers)

Model Info: Microsoft, Harbin Institute of Technology (2021)

Architecture: Transformer Encoder

Parameters: 86M, 307M

BEiT is another transformer-based model for computer vision based on the encoder architecture. It divides the image into 14x14 squares, and for each square, two representations are created:

One based on pixels (the square is transformed into a vector, similar to text). Another one based on internal representations from a VAE (a vector from a pretrained VAE)

The VAE has 8K vector tokens, and it can be pretrained or borrowed from another model like DALL-E. BEiT is pretrained using a Masked Image Model (MIM), where pixel vectors are input, and the model predicts the VAE representation vectors for masked tokens.

Subsequently, fine-tuning can be performed on the target task or using adapter networks.

Models for Generating Images from Text and Images

This section discusses models that gained significant attention in 2022 for generating images from textual descriptions. Currently, diffusion models combined with transformers are dominant in this field, allowing not only image generation but also content manipulation and resolution enhancement.

Diffusion Models Explained

The core idea is to iteratively add normal noise to an image, which the model learns to predict for removal.

The model learns to predict noise based on the image and the noise iteration step.

During training, pairs of "image - noise step" are presented, and the model predicts noise using Mean Squared Error (MSE). The noise step is a trainable vector ranging from 0 to T.

Formulas are used to perform reparameterization tricks and obtain a variational lower bound estimation (ELBO), allowing noise application not sequentially but immediately for step t. This speeds up training.

During inference, the input consists of the noisy image and the noise step T. The model iteratively removes noise from the input, resulting in the final output image.

The process can be conditioned on text prompts or other input factors.

1. DALL-E

Model Info: OpenAI (2021)

Architecture: VAE + Transformer Decoder

Parameters: 12B

DALL-E's work is divided into two stages: first, tokens for images are trained, and then a joint generative model for text and images is learned.

In the first stage, a disentangled Variational Autoencoder (dVAE) is trained to map images from 256x256x3 space to 32x32x'dim' space and back, where 'dim' represents the dimensionality of the hidden representation. A total of 8192 such token vectors are used in the model.

The primary model employs a Sparse Transformer decoder. It takes text tokens and image tokens as input, learns a joint distribution (Causal LM), and can generate image tokens based on text. DALL-E generates images using these tokens via dVAE. The loss weight for text tokens is 1/8, and for image tokens, it's 7/8.

Text tokens use standard and positional embeddings, while image tokens use standard embeddings and positional embeddings for columns and rows. The maximum sequence length for text tokens is 256, and BPE tokenization is used (vocabulary of 16K).

Several forms of self-attention are used:

Text tokens can attend to all input tokens.

Image tokens can attend to all text tokens.

Image tokens can attend to each other using two types of attention: vertical and horizontal neighbors.

BPE dropout is used for text tokens during training for regularization. Most of the weights and activations are in float16, with occasional transitions to float32 for stability (e.g., in residual connections).

2. Latent Diffusion (Stable Diffusion)

Model Info: CompVis [Stability AI] (2021, 2022)

Architecture: U-Net + VAE + Transformer Encoder

Parameters: 170M-400M, 1.45B

The key idea here is that a diffusion model operating in pixel space first learns to transform everything into a lower-dimensional latent representation before working with it. This means it makes sense to train the model in that space from the start, rather than on pixels.

Two models are trained:

A variational autoencoder (VAE) for dimensionality reduction and latent space generation.

A diffusion model (DM) on the internal representations.

The VAE is trained in a GAN-style setup with a discriminator on its outputs, and additional regularization enforces proximity of representations to a standard normal distribution.

Conditional generation can be based on text, images, or semantic maps. Any of these objects are encoded with their own encoder model into one or more vectors. The result goes into each diffusion step in the latent space: if the condition is a vector, it's concatenated with the input latent vector for that step; if it's a sequence of vectors, it's used in cross-attention across different layers of the U-Net. For text prompts, a text encoder from CLIP is used.

The same model is trained for various tasks, including text-to-image, coloring, inpainting, and super-resolution.

3. Imagen

Model Info: Google (2022)

Architecture: 3 x U-Net + Full Transformer

Parameters: 2B + 600M + 400M + 11B

Imagen is based on the idea that expanding the size of the text encoder can bring more benefits to the generative model than expanding the DM size. Therefore, CLIP is replaced with a standard T5-XXL, which outputs a set of embeddings that are used in the cross-attention of all trainable DMs. Three DMs are used: one primary and two resolution-increasing DMs (64 -> 256 -> 1024).

The base image is generated from noise conditioned on a text vector, and the resolution-increasing models are conditioned on corrupted images with normal blur and noise.

To control the generation, classifier-free guidance is used, where text vectors are zeroed in 10% of cases. The base U-Net architecture in the work was optimized for speed and memory consumption.

Models for Generating Text from Text and Images

This section discusses models commonly referred to as multimodal models. These models generate text but can also analyze and process data of different types, such as text and 2D images. They can produce textual outputs or sets of commands, e.g., for robots.

1. SimVLM

Model Info: Google (2021)

Architecture: Full Transformer

Parameters: 86M, 307M, 632M

In SimVLM, the encoder receives tokens from both an image and the prefix text description, and the decoder generates tokens for the suffix of the text description (Prefix LM).

The image is divided into squares and vectorized using CNN (the first three blocks of ResNet). Both image and text tokens have their own learnable positional embeddings. Ordinary self-attention is applied to all tokens. Image tokens additionally have 2D relative attention in the encoder block.

The model is pretrained on 800GB of text and 1.8 billion pairs of images and their descriptions. Fine-tuning involves six tasks with adjustments to all model parameters.

The vocabulary size is 32K, and the maximum sequence length for both the encoder and decoder is 256. Images have a resolution of 224x224 and are divided into 14x14 squares.

2. CoCa

Model Info: Google (2022)

Architecture: Transformer Encoder + Transformer Decoder

Parameters: Varies (multiple model sizes)

CoCa employs a separate encoder for images (ViT or CNN) and a shared decoder, where the first half processes text, and the second half processes both text and the output from the image encoder.

Images of size 288x288 are divided into 18x18 squares, which the encoder transforms into vectors. An overall vector is created using attention pooling based on all these vectors.

The first half of the decoder outputs text vectors and the CLS token vector at the end of the sequence. SentencePiece tokenization with a 64K vocabulary is used. Text and image vectors are combined in the second half of the decoder through cross-attention.

Two loss functions with weights of 1 and 2 are used:

The similarity between the vector from the image's attention pooling and the text's CLS token vector for the image-description pair.

An autoregressive loss on the decoder's overall output (conditioned on the image).

During fine-tuning, it is possible to freeze the image encoder and only fine-tune the attention pooling.

3. GPT-4

Model Info: OpenAI (2023)

Architecture: Transformer Decoder + (?)

GPT-4 is a closed model with limited publicly available details. It is believed to be a decoder with sparse attention and multimodal input capabilities. The training includes autoregressive learning and reinforcement learning from human feedback (RLHF). The sequence length in GPT-4 models ranges from 8K to 32K.

GPT-4 has been tested on human exams in zero-shot and few-shot settings, performing at or above human level. It can solve tasks based on images (including mathematical tasks), both directly and step by step. It understands and explains images, analyzes and generates code, and operates in various languages, including low-resource languages.

Models for Sound Analysis and ASR

In this final section, models for ASR are discussed. These models are typically based on pretraining on unlabeled data to generate effective features, followed by fine-tuning on labeled data for specific tasks like ASR, speech translation, or language identification. Sound processing often involves a combination of CNN and transformer encoder to extract local features and perform global processing across the entire sequence of audio frames.

1. Conformer

Model Info: Google (2020)

Architecture: Conformer (Transformer Encoder + CNN)

Parameters: 10M, 30M, 118M

Conformer introduces a block that combines a transformer block with an embedded convolutional block. The components of the block include:

FF-module: pre-LayerNorm + 2 fully connected layers with Swish activation + dropout.

Attention-module: pre-LayerNorm + multi-head self-attention + dropout.

CNN-module: 2 pointwise Convs + 1D Depthwise Conv + Layer/Batch norm + activations (Swish, Glu) + dropout.

The Conformer block consists of the following sequence: 0.5 * FF-module + Attention-module + CNN-module + 0.5 * FF-module, all with residual connections, topped with LayerNorm.

The final model architecture (from bottom to top) includes SpecAug, Conv subsampling, a fully connected layer + dropout, N Conformer blocks, and an LSTM layer for decoding.

2. T-T (Streaming Transformer Transducer)

Model Info: Microsoft (2020)

Architecture: Conformer (Transformer Encoder + CNN) + LSTM

Parameters: 80M

T-T is an evolution of Conformer. The input signal (acoustic features) is segmented into frames and fed into the encoder with Conformer blocks, with slight modifications.

Positional encoding is relative (lookup table), and vectors derived from indices i and j are added to key vectors in self-attention. Due to the length of sequences, attention scope needs to be reduced:

Attention masks are used and shared across all layers.

Frames are grouped into non-overlapping blocks.

Within each block, frames attend to each other.

Frames from the left block cannot attend to frames from the right block.

Frames from the right block can attend to frames from the left block if they fall within a fixed-size window.

Above the encoder outputs, an LSTM predictor predicts the response vector for the current frame based on the previous response. This vector is summed with the encoder output. The resulting vector is passed through a fully connected layer and softmax to predict the response for the current frame. To optimize efficiency, key and value caching are used, and blocks of frames are processed at once with a slight delay for individual frame predictions.

3. Hubert

Model Info: Facebook (2021)

Architecture: CNN + Transformer Encoder

Parameters: 95M, 317M, 964M

Hubert is conceptually similar to Wav2Vec2. It learns to predict representations for masked frames using another method.

The architecture includes:

A CNN encoder for audio signals (7 layers with 512 channels).

A transformer encoder.

A fully connected projection layer and code vector layer.

Training occurs in two stages: K-means and the main encoder.

First Stage (K-means):

Audio signals (WAV) are segmented into audio frames, which are converted into MFCC feature vectors (dim=39).

K-means (K=100) is trained on these vectors, generating code vectors.

Each input frame can be associated with a code vector.

Second Stage (Encoder):

Similar to Wav2Vec2, frames are masked (8% of length, span length=10).

CNN provides feature vectors for frames in 20ms.

Masked frames are replaced with a learned mask vector.

The input is fed into the encoder, and the output is projected to the dimension of code vectors.

Cosine similarity between the output vector and all code vectors is computed, and they are passed through a softmax with temperature.

Cross-entropy loss is computed on this distribution for masked tokens.

Iterations of these stages continue, starting from the second iteration, K-means is trained on representations from an intermediate layer of the latest version of the encoder.

This two-stage approach simplifies and stabilizes training, eliminating the need for complex losses like in Wav2Vec2. The pretrained model is fine-tuned (with the CNN frozen) on labeled recordings.

Conclusion: The Future of AI Models

AI models are advancing rapidly. They're getting bigger, not just by adding more layers and growing datasets, but also by incorporating alternative techniques. These advancements aim to enhance model quality, from working with external data and tools to refining network structures and adopting new fine-tuning methods. However, these improvements complement rather than replace the growth in the size of industrial solutions.

Open Source Models on the Rise

While industrial giants continue to produce larger models, open-source models are quickly catching up in terms of quality. Remarkably, they achieve this with significantly fewer parameters, potentially changing the landscape.

OpenAI's Strategy

OpenAI's strategy seems to be moving toward closed-source models. Yet, past attempts to withhold GPT-2's weights were unsuccessful. Recent trends in optimizing the cost of fine-tuning and inference speed for open-source models are diminishing the value of large private models as products.

Quality Over Quantity

Quality in training data is increasingly outweighing quantity. Proper selection and curation of datasets can shorten training times and improve results significantly.

Promising Models to Watch

Here's a quick look at models worth exploring:

Multilingual Encoder Models: XLM-RoBERTa and LaBSE are reliable options.

Generative Models: LLaMA, EleutherAI models (with various fine-tuned versions), Dolly-2, BLOOM (especially instruct-learn variants) stand out.

Russian Language Models: Consider Sber's open-source models (Russian GPT-3, FRED-T5), and again, LLaMA with suitable fine-tuning.

Code-Related Models: SantaCoder is interesting, although ChatGPT/GPT-4 outperforms them in quality.

Innovative Approaches: Transformer-XL and Sparse Transformer introduce techniques from other works, worth a closer look.

Notable Papers

Don't miss significant papers marked in italics at the end, including LaMDA, Megatron-Turing NLG, and Sparrow.

Thank you for your attention, and best of luck in your endeavors!

Alina Khay

Contact Me

Linkedin

https://www.linkedin.com/in/alinakhay

Email

[email protected]

Github

https://github.com/alinakhay