Sequence-to-sequence (seq2seq) models have revolutionized the field of machine translation and have become the tool of choice for various text-generation tasks, such as summarization, sentence fusion and grammatical error correction. Improvements in model architecture (e.g., Transformer) and the ability to leverage large corpora of unannotated text via unsupervised pre-training have enabled the quality gains in neural network approaches we have seen in recent years.
Yet, the use of seq2seq models for text generation can come with a number of substantial drawbacks depending on the use case, such as producing outputs that are not supported by the input text and requiring large amounts of training data to reach good performance. Furthermore, seq2seq models are inherently slow at inference time, since they typically generate the output word-by-word.
LaserTagger one of the Google NextGen NLP engine, is an alternative approach and its highly preferable owing to the speed and precision of the method. Instead of generating the output text from scratch, LaserTagger produces output by tagging words with predicted edit operations that are then applied to the input words in a separate realization step. This is a less error-prone way of tackling text generation, which can be handled by an easier to train and faster to execute model architecture.
Design and Functionality of LaserTagger A distinct characteristic of many text-generation tasks is that there is often a high overlap between the input and the output. For instance, when detecting and fixing grammatical mistakes or when fusing sentences, most of the input text can remain unchanged, and only a small fraction of the words needs to be modified. For this reason, LaserTagger produces a sequence of edit operations instead of actual words. The four types of edit operations we use are: Keep (copies a word to the output), Delete (removes a word) and Keep-AddX / Delete-AddX (adds phrase X before the tagged word and optionally deletes the tagged word). This process is illustrated in the figure, which shows an application of LaserTagger to sentence fusion.
All added phrases come from a restricted vocabulary. This vocabulary is the result of an optimization process that has two goals: (1) minimizing the vocabulary size and (2) maximizing the number of training examples, where the only words necessary to add to the target text come from the vocabulary alone. Having a restricted phrase vocabulary makes the space of output decisions smaller and prevents the model from adding arbitrary words, hence mitigating the problem of hallucination. A corollary of the high-overlap property of input and output texts is that required modifications tend to be local and independent from one another. This means that the edit operations can be predicted in parallel with high accuracy, enabling a significant end-to-end speed up compared to autoregressive seq2seq models, which perform the predictions sequentially, conditioning on the previous predictions.
Results I tested LaserTagger on four tasks: sentence fusion, split and rephrase, abstractive summarization, and grammar correction. Across the tasks, LaserTagger performs comparably to a strong BERT-based seq2seq baseline that uses a large number of training examples, and clearly outperforms this baseline when the number of training examples is limited. Below we show the results on the WikiSplit dataset, where the task is to rephrase a long sentence into two coherent short sentences.
Key Advantages of LaserTagger Compared to traditional seq2seq methods, LaserTagger has the following advantages:
1. Control: By controlling the output phrase vocabulary, which we can also manually edit or curate, LaserTagger is less susceptible to hallucination than the seq2seq baseline.
2. Inference speed: LaserTagger computes predictions up to 100 times faster than the seq2seq baseline, making it suitable for real-time applications.
3. Data efficiency: LaserTagger produces reasonable outputs, even when trained using only a few hundred or a few thousand training examples. In our experiments, a competitive seq2seq baseline required tens of thousands of examples to obtain comparable performance.
Why This Matters The advantages of LaserTagger become even more pronounced when applied at large scale, such as improving the formulation of voice answers in some services by reducing the length of the responses and making them less repetitive. The high inference speed allows the model to be plugged into an existing technology stack, without adding any noticeable latency on the user side, while the improved data efficiency enables the collection of training data for many languages, thus benefiting users from different language backgrounds.
Comments