Difference between revisions of "Attention and Transformers"

From Psyc 40 Wiki
Jump to: navigation, search
Line 9: Line 9:
 
So, this kind of modelling works for shorter sentences, but processing longer strings in this manner requires enormous computational power, making it less than desirable. This is where the introduction of attention really helped, because it made it so that instead of looking N words back, the programme might pick out the most significant relationships between words and only focus on those. State-of-the-art systems like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) all were designed so that, apart from the recursive structure, they also implemented an attention mechanism.
 
So, this kind of modelling works for shorter sentences, but processing longer strings in this manner requires enormous computational power, making it less than desirable. This is where the introduction of attention really helped, because it made it so that instead of looking N words back, the programme might pick out the most significant relationships between words and only focus on those. State-of-the-art systems like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) all were designed so that, apart from the recursive structure, they also implemented an attention mechanism.
  
However, in 2017, a team at Google Brain proposed a new model, which completely disregarded the recursive element and solely focused on attention.
+
However, in 2017, a team at Google Brain proposed a new model which completely disregarded the recursive element and solely focused on attention. It turns out that the sequential position of the tokens does not affect the learning much, meaning that these systems work just as well as the recursive ones, but take a lot less time and computational power to train, and are thus far more practical. These attention-only systems are known as transformers.
  
 
==Attention==
 
==Attention==

Revision as of 23:44, 21 October 2022

(By Aleksa Sotirov)

Transformers are a particular type of deep learning model, characterised most uniquely by their use of attention mechanisms. Attention in machine learning is a technique that involves the differential weighing of the significance of different parts of input data - in essence, mimicking human attention. In particular, transformers are specialised in using self-attention to process sequential data, which makes their main applications in fields like natural language processing (NLP) and computer vision (CV). They are distinct from previous models, including recursive neural networks and long short-term memory models, by their increased parallelisation and efficiency, which is largely due to the utility of attention mechanisms.

Background

Before transformers, the most frequently used natural language processing models on the market were all based on recursion. This means not only that each word or token was processed in a separate step, but also that the implementation included a series of hidden states, with each state being dependent on the one before it. For example, one of the most basic problems in NLP is that of next-word prediction. A recursive model does a great job of this in some cases, e.g. if the input was "Ruth hit a long fly ___", the algorithm would only have to look at the last two words in order to predict the next word as "ball". However, in a sentence like "Check the battery to find out whether it ran ___", we would have to look eight words back in order to get enough context from the word "battery" to determine that the next word should be "down". This would require an eighth-order model, which would be a ridiculous hardware requirement.

So, this kind of modelling works for shorter sentences, but processing longer strings in this manner requires enormous computational power, making it less than desirable. This is where the introduction of attention really helped, because it made it so that instead of looking N words back, the programme might pick out the most significant relationships between words and only focus on those. State-of-the-art systems like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) all were designed so that, apart from the recursive structure, they also implemented an attention mechanism.

However, in 2017, a team at Google Brain proposed a new model which completely disregarded the recursive element and solely focused on attention. It turns out that the sequential position of the tokens does not affect the learning much, meaning that these systems work just as well as the recursive ones, but take a lot less time and computational power to train, and are thus far more practical. These attention-only systems are known as transformers.

Attention

Transformers

Applications

References