Natural Language Process (NLP)

An Artificial Intelligence that can understand natural language in its context and is capable of communicating in any language. This doesn't sound like science fiction is not yet possible but recent progress in machine learning (ML) and natural language processing (NLP) indicates that we are getting very close to these teaching capabilities when it comes to natural language.

For a machine to possess capabilities at the human level is very complicated, given how complex it is to understand language. However, recent progress in NLP shows impressive results like the one we will discuss in this article.

El transfer learning is the process of training a model on a large-scale data set and then using that previously trained model to carry out the learning for another subsequent task (i.e. the target task). Transfer learning became popular in the field of computer vision thanks to the ImageNet data set. In this article we will focus on how these concepts apply to the field of natural language processing.

We have gotten very good at predicting a very accurate result with very good training models. But we must consider that most of the tasks that we carry out with these models are not general at all, if not the contrary, they are specific to a single objective or domain. The real world is not enclosed in the data set that we train, but it is something much more extensive and messy, therefore that model that we have trained previously, if we use it for a more general objective, its effectiveness will surely decrease significantly.

Transferred learning is the application that is obtained from one context to another context. Applying knowledge from a model could help reduce training time and deep learning problems by taking existing parameters to solve “small” data problems.

What are those advantages

  • Simpler training requirements using previously trained data
  • Much smaller memory requirements
  • Significantly shorter target model training - seconds instead of days
  • These models allow higher performance with less data.
  • They are easier to use than traditional deep learning models, so they do not require a data scientist with a specialization in NLP

A very important aspect in transfer learning is to choose the pre-trained model for our specific task, for this I want to show you the most popular models used for these learning functions. They are models created by large companies like Google, Facebook, OpenAI, etc ... models that need large amounts of data and above all a lot of computing power and time.

Some of these models are:


(Embeddings from Language Models)

ELMo is a novel way to represent words in vectors or inlays. These word embeds are useful for achieving great results in various NLP tasks. ELMo word vectors are calculated on a two-layer bidirectional language model (biLM) using so-called recurring LSTM (Long Short Memory) networks. This biLM model has two stacked layers and each layer has 2 steps, forward and backward, where in this way contextual learning of words occurs



OpenAI Transformers

GPT-3 is a language model powered by neural networks.

Like most language models, GPT-3 is elegantly trained on an unlabeled text dataset (in this case, the training data includes, but is not limited to, Common Crawl and Wikipedia). Words or phrases are randomly removed from the text and the model must learn to complete them using only the surrounding words as context. It is a simple training task that results in a powerful and generalizable model.

The architecture of the GPT-3 model itself is a neural network based on Transformers. This architecture became popular around 2 to 3 years ago and is the basis for the popular NLP BERT model and the GPT-3's predecessor, GPT-2. At 175 billion parameters, it is the largest language model ever created (an order of magnitude greater than its closest competitor!), And it was trained on the largest data set of any language model. This, it seems, is the main reason why GPT-3 is so impressively “smart”. This is what makes GPT-3 so exciting for machine learning professionals


(Bidirectional Encoder Representations from Transformers)

BERT's technical innovation is applying Transformer's two-way training, a popular attention model, to language modeling. This is in contrast to previous efforts that looked at a left-to-right sequence of text or a combined left-to-right and right-to-left formation. The BERT results show that a model that is trained bidirectionally can have a deeper sense of context and flow of language than unidirectional language models. BERT uses Transformer, an attention mechanism that learns the contextual relationships between words (or subwords) in a text. In its basic form, Transformer includes two separate mechanisms: an encoder that reads the text input and a decoder that produces a prediction for the task. Since the goal of BERT is to generate a language model, only the encoder mechanism is required. The detailed operation of Transformer is described in this Google doc

Unlike directional models, which read text input sequentially (left to right or right to left), the Transformer encoder reads the entire sequence of words at once.


There are three ways we can transfer learning from pretrained models:

  • Feature extraction: where the pretrained layer is used to extract only features like using BatchNormalization to convert the weights to a range between 0 and 1 with a mean of 0. In this method, the weights are not updated during backpropagation. This is what is marked as untrainable in the model summary.
  • Fine tuning(Fine Tuning): Something this article is about, where BERT is ideal for this task, because it is trained to answer questions. So we just have to adjust the model to suit our purpose. Technically Fine Tunning It is the process in which the parameters of a trained model must be adjusted very precisely while we try to validate that model taking into account a small data set that does not belong to the original set.
  • Extract layers- In this method, we extract only the layers needed for the task, for example, we might want to extract only the lower-level layers in BERT to perform tasks like POS, sentiment analysis, etc., where only level characteristics would be extracted word of mouth.

Apply transfer learning.

For transfer learning, we generally have two steps. Use data set X to previously train your model and then we will use that previously trained model to bring that knowledge to the resolution of data set B. In this case, BERT has been previously trained in BookCorpus and Wikipedia and will be the one we will use to our practice on a specific text, where we can ask you questions so that using natural language, you can answer them correctly.

For this we will use the Python programming language and the Machine Learning library, Pytorch. We will load our pre-trained BERT model, which we can find at and we will pass you a text on a specific topic, for example the Biography of Miguel Cervantes, the first chapter of Moby Dick or any article that we can find on the Internet. Our model will use BERT to understand the context of the article that we have passed to it and in this way to be able to answer questions that we ask in relation to it.

For example, ask him where Cervantes was born or who his brothers were or where he enlisted when he was young. The idea is that questions can be asked with a natural language without having to specify in a concrete way or use the same words as the text learned by the AI.

Original video of the practice

Texts used for practice: Cervantes y Moby Dick


We are facing a new paradigm of the known Natural Language Processing (NLP)Recent advances in research have been dominated by the combination of transfer learning methods with language models Transformers on a large scale.

Building these general-purpose models is still a costly and time-consuming process that restricts the use of these methods to a small subset of the NLP community. With the Transformers, a paradigm shift occurred, with the new idea to train a model on a later task going from a blank specific model to a general-purpose pre-trained architecture.

As NLP becomes a key aspect of AI, the democratization of Transformers will open more doors for emerging researchers. The option that state-of-the-art pre-trained models like BERT can be accessed without having to build them from scratch will give professionals an edge so they can focus on their goal rather than reinventing big models.