Difference between revisions of "Artificial Intelligence Music Creation with Neural Networks"
Line 17: | Line 17: | ||
'''Recurrent Neural Networks (RNN)''' | '''Recurrent Neural Networks (RNN)''' | ||
− | A recurrent neural network is a type of deep-learning model that takes a sequential set of inputs, puts it through hidden layers that have randomized weights, and then trains the model to eventually minimize the loss between the actual output and the desired output (the music desired). A recurrent neural network uses back-propagation, meaning the outputs of each node can be used as the new input. Since music is fundamentally a sequence of notes, | + | A recurrent neural network is a type of deep-learning model that takes a sequential set of inputs, puts it through hidden layers that have randomized weights, and then trains the model to eventually minimize the loss between the actual output and the desired output (the music desired). A recurrent neural network uses back-propagation, meaning the outputs of each node can be used as the new input. Since music is fundamentally a sequence of notes, rhythms, and pitches, the recurrent neural network is an ideal tool to generate an artificial musical output. Moreover, compared to a feed-forward or Conventional Neural Network (CNN), a recurrent neural network factors in the ordering of data points. [https://local.cis.strath.ac.uk/wp/extras/msctheses/papers/strath_cis_publication_2725.pdf] With music, the long-term sequence and order of notes are vital to predict every ensuing note, so a feed-forward network would not suffice. |
+ | |||
+ | A recurrent neural network works by receiving a sequential input, such as a MIDI file, which stores a set of musical notes, rhythm, and pitch. In a supervised recurrent neural network, the model would also receive a given output it should reach, which is a different sequence of musical notes, rhythm, and pitch. To reach this output layer, the model trains by going through nodes in layers, with a weight assigned to each node. The model takes the output from each layer at time t, as well as the given input, to use as the next input for time t + 1. As the model trains through its hidden layers, the weights of each node adjust to minimize the loss. [https://medium.com/@james.matson_64120/artificial-intelligence-music-creation-with-recurrent-neural-networks-rnn-be08d1d3c759] So, when given the input, the model will calculate the probability vector of each possible ensuing note/chord, rhythm, and pitch, given the previous inputs. Next, the model will compute the loss function and take the average of the loss on each note to get an overall loss function. The loss function can be implemented in various ways, but is commonly the cross-entropy loss function, which takes the difference between between the predicted output layer and the target output. Finally, the model will iterate and generate a new probability function given the new weights, until it eventually reaches the output layer. | ||
+ | |||
+ | ''' Limitations of the RNN''' | ||
+ | |||
+ | |||
+ | However, recurrent neural networks have a few problems that limit their capability; firstly, there is a vanishing gradient problem, which means the information stored in the earlier hidden layers of a network is lost because most of the back-propagated information comes from the later hidden layers. Moreover, the further you go into your network, the harder it is to train your model as the gradients (or slopes) get smaller and the weights given to the starting layers are smaller. A vanishing gradient problem occurs when the starting weights of the model are close to zero, and the gradients are less than 1. Given that the notes and rhythms at the start of a musical piece are important to predicting future notes and rhythms, the vanishing gradient problem limits the RNN's accuracy. [https://www.superdatascience.com/blogs/recurrent-neural-networks-rnn-the-vanishing-gradient-problem] | ||
+ | |||
+ | Another challenge with RNNs, which is essentially the opposite of the vanishing gradient problem, is the exploding gradient. In this case, during the backpropagation of the network, the gradients (or slopes) of each hidden layer get exponentially larger as we move backward. The exploding gradient problem occurs when the initial weights are too high, which leads to higher ensuing gradients (which are greater than 1). The exploding gradient problem makes it tough for a model to converge to any values as the parameters become so large that they overflow. [[https://www.geeksforgeeks.org/vanishing-and-exploding-gradients-problems-in-deep-learning/]] | ||
+ | |||
+ | '''Long-Short Term Memory (LSTM) Models ''' | ||
+ | |||
+ | To solve the vanishing and exploding gradient problems in AI music generation, a Long-Short Term Memory (LSTM) model can be used, which is a variation of an RNN, that achieves better transmission of information between the layers in the neural network.[https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/reports/custom/report11.pdf] An LSTM does this by including a forget gate/vector, along with an input and output, that controls the self-recurrent link of the memory cell to remember and forget previous states whenever required. [https://www.sciencedirect.com/science/article/abs/pii/B9780128213797000035] Numerically, this allows for the LSTM to control the gradient from being too high or too low, and set it equal to just 1. |
Revision as of 07:27, 22 October 2022
by Sid Singh
Artificial Intelligence (AI) music creation is using generative artificial intelligence to produce music, often that aims to resemble human-made music. [1]
To replicate human-created music compositions, a deep learning model is used. Deep learning models, a type of artificial neural network, are representations of the human brain that perform tasks such as classification and representation, such as finding a common note or pattern in music. [2] The models use training data, such as human-made music, as well as hidden ("deep") layers to train a randomized input into the desired output -- artificially created music that resembles a human creation. The use and popularity of music created by generative AI has skyrocketed in recent years [3], with its reception being a mixed bag of critics and believers. [4]
History of AI Music
Artists and engineers have aimed to transcribe and recreate music since the 19th century, when music rolls were introduced commercially. Music rolls, typically made for reproducing piano, are sheets of paper with holes that model the notes and rhythm of music [5]. When these sheets are moved over a tracker bar, the instrument can recreate the piece without a human player.
The modern-day, digital version of a music roll is a Musical Instrument Digital Interface (MIDI) file, which was introduced in the late 20th century by Dave Smith [6]. Now, MIDI files are often the input for deep learning models to learn music. Before MIDI, though, AI music had already begun being generated; in 1957, composers Lejaren Hiller and Leonard Isaacson created The Illiac Suite, the first known computer-composed piece of music. [7] Since then, the scope and capability of Artificial music has greatly expanded. Recent projects such as Google's Magenta Project [8] and OpenAI's MuseNet [9] allow users to give the computer certain requirements, such as instruments and lyrics to include, to create unique artificial arrangements. Artificial "deepfakes" of popular artists are another recent development of the AI music realm; the song "Heart on My Sleeve", which imitates the voices of popular artists Drake and The Weeknd, gained over 15 million views on social media platform, TikTok, and over 600,000 streams on popular streaming platform Spotify. [10]
Recurrent Neural Networks (RNN)
A recurrent neural network is a type of deep-learning model that takes a sequential set of inputs, puts it through hidden layers that have randomized weights, and then trains the model to eventually minimize the loss between the actual output and the desired output (the music desired). A recurrent neural network uses back-propagation, meaning the outputs of each node can be used as the new input. Since music is fundamentally a sequence of notes, rhythms, and pitches, the recurrent neural network is an ideal tool to generate an artificial musical output. Moreover, compared to a feed-forward or Conventional Neural Network (CNN), a recurrent neural network factors in the ordering of data points. [11] With music, the long-term sequence and order of notes are vital to predict every ensuing note, so a feed-forward network would not suffice.
A recurrent neural network works by receiving a sequential input, such as a MIDI file, which stores a set of musical notes, rhythm, and pitch. In a supervised recurrent neural network, the model would also receive a given output it should reach, which is a different sequence of musical notes, rhythm, and pitch. To reach this output layer, the model trains by going through nodes in layers, with a weight assigned to each node. The model takes the output from each layer at time t, as well as the given input, to use as the next input for time t + 1. As the model trains through its hidden layers, the weights of each node adjust to minimize the loss. [12] So, when given the input, the model will calculate the probability vector of each possible ensuing note/chord, rhythm, and pitch, given the previous inputs. Next, the model will compute the loss function and take the average of the loss on each note to get an overall loss function. The loss function can be implemented in various ways, but is commonly the cross-entropy loss function, which takes the difference between between the predicted output layer and the target output. Finally, the model will iterate and generate a new probability function given the new weights, until it eventually reaches the output layer.
Limitations of the RNN
However, recurrent neural networks have a few problems that limit their capability; firstly, there is a vanishing gradient problem, which means the information stored in the earlier hidden layers of a network is lost because most of the back-propagated information comes from the later hidden layers. Moreover, the further you go into your network, the harder it is to train your model as the gradients (or slopes) get smaller and the weights given to the starting layers are smaller. A vanishing gradient problem occurs when the starting weights of the model are close to zero, and the gradients are less than 1. Given that the notes and rhythms at the start of a musical piece are important to predicting future notes and rhythms, the vanishing gradient problem limits the RNN's accuracy. [13]
Another challenge with RNNs, which is essentially the opposite of the vanishing gradient problem, is the exploding gradient. In this case, during the backpropagation of the network, the gradients (or slopes) of each hidden layer get exponentially larger as we move backward. The exploding gradient problem occurs when the initial weights are too high, which leads to higher ensuing gradients (which are greater than 1). The exploding gradient problem makes it tough for a model to converge to any values as the parameters become so large that they overflow. [[14]]
Long-Short Term Memory (LSTM) Models
To solve the vanishing and exploding gradient problems in AI music generation, a Long-Short Term Memory (LSTM) model can be used, which is a variation of an RNN, that achieves better transmission of information between the layers in the neural network.[15] An LSTM does this by including a forget gate/vector, along with an input and output, that controls the self-recurrent link of the memory cell to remember and forget previous states whenever required. [16] Numerically, this allows for the LSTM to control the gradient from being too high or too low, and set it equal to just 1.