Introduction

As we write this, deep learning has firmly made its way into natural language processing, complementing its dominance in image recognition. Two years ago already, Sebastian Ruder proclaimed that NLP's ImageNet moment has arrived. The analogy here is not about the dataset per se, but about availability of models, trained on a rich collection of data, that can generalize to a variety of tasks in the overall domain. Just like in image recognition, a model trained to classify ImageNet is assumed to have learned a lot about features on different scales – edges, corners, shapes, components of objects, objects –, in NLP a language model, trained to predict words from their surroundings,is assumed to have learned a lot about syntax, semantics, and – perhaps – even about the world. This general knowledge is then useful in tasks such as machine translation, question answering, and natural language inference.

In the previous section on image recognition, we jumped directly into transfer learning, reflecting our belief that often, you won’t need to train your own convolutional network from scratch, and that (straightforward) convnets must be among the very most-extensively-documented deep learning topics on the net.

With NLP, it’s different. One of the most important recent concepts in deep learning – attention – originally emerged in the area of machine translation, but has been, and is being, applied in lots of other areas since then. Thus, we will first develop an attention-based translation model from scratch, to get a feeling of how this works. At the same time, this model features recurrent neural networks (RNNs), the family of neural network layers that historically used to claim best performance on sequence data.

Our second application then illustrates the use of “historically” in the preceding sentence. The famous Transformer architecture made clear that recurrent layers are not a necessity when dealing with sequential input – as long as attention mechanisms are present. Our example model will address the same machine translation task as the previous one (just on a different target language). This time, we won’t start from scratch, but make use of intermediate-to-high-level modules conveniently provided by torch.