A Novel Bi-LSTM And Transformer Architecture For Generating Tabla Music
3
To facilitate comprehension by LSTMs, the notes and chords
are encoded as unique numerical values using a dictionary. A
sliding window approach is employed to create sequences of
100 notes, serving as inputs to train the LSTM. The LSTM
predicts the next note and compares it to the actual subsequent
note for evaluation.
To pre-process waveform files using the ‘librosa’ library, the
audio data is loaded and converted into a time-domain signal.
The signal is then divided into frames, either overlapping or
non-overlapping, for further analysis. The frames undergo a
Short-Time Fourier Transform (STFT) to obtain the
magnitude and phase spectra, often visualized as a mel-
spectrogram. Additional representations like mel-frequency
cepstral coefficients (MFCCs), spectral contrast, chroma
features and tonal centroid can be extracted from the
spectrogram. Normalization techniques such as mean
normalization or z-score normalization are commonly
applied to the extracted features to ensure consistent scales
across different audio files.
All the models were trained using a Nvidia A100 GPU using
a Google Colab Pro subscription.
4. Methodology
Long Short-Term Memory (LSTM) a variant of recurrent
neural networks, overcomes the vanishing gradient problem.
It is a powerful tool for sequential data analysis, such as in
natural language processing (NLP) and music generation. It
employs gates and memory cells to selectively retain or
discard information, enabling the capture of long-term
dependencies while handling noisy input. The gates, using
sigmoid and tanh functions, control information flow, while
memory cells store and transmit data across time steps.
The Bidirectional LSTM (Bi-LSTM) consists of two LSTM
layers that handle input sequences in both forward and
backward directions. Each layer's output is combined at each
time step before passing through the output layer. This
approach captures dependencies in both directions,
addressing limitations of a unidirectional LSTM and
improving overall performance.
When training on extensive datasets, neural networks can
overlook vital information due to the fixed-length context
vector structure, resulting in suboptimal performance. To
address this, an attention layer is employed to enhance the
model's capabilities by focusing on crucial segments of the
input sequence when predicting an output. The attention
mechanism enables the model to learn associations and
extract information from each encoder hidden state,
significantly influencing the development of transformers [9].
This mechanism can be seamlessly integrated into neural
networks built with different layers. It was initially
introduced by Bahdanau et al. in 2014 and has since become
an integral part of various architectures.
The alignment scores, e
t,i
, are computed using the encoded
hidden states h
i
and the previous decoder output, s
t-1
as
denoted in equation (1). These scores represent the alignment
between the input sequence elements and the current output
at position t. A feedforward neural network, represented by a
function a(.), can be employed to implement the alignment
model.
,
= (
,ℎ
)
(1)
The alignment scores obtained earlier are then used to
compute the weights through a softmax operation as
represented in equation (2):
,
=
,
∑
,
(2)
The context vector c
t
for the output sequence at position t is
calculated using the weighted sum over all the T hidden states
as denoted by equation (3):
=
,
ℎ
(3)
The transformer architecture, introduced in the 2017 paper
“Attention is all you need” by Vaswani et al., incorporates the
self-attention mechanism as its core component. This
mechanism enables the model to assign importance weights
to different words within a sequence [10]. By calculating
attention scores between word pairs, the model learns their
relevance to each other. The transformer comprises an
encoder and a decoder. The encoder processes the input
sequence through multiple stacked layers of self-attention and
feed-forward neural networks. The decoder generates an
output sequence step by step based on the encoded
representation.
To compensate for the lack of inherent word order capturing
in the transformer model, positional encoding is introduced.
It conveys positional information to the model, enabling it to
comprehend the sequence's sequential arrangement. The
transformer employs multiple attention heads to learn diverse
relationships and aspects of the input sequence. The outputs
from these heads are combined by concatenation and
transformation to yield the final attention representation.
Alongside the self-attention mechanism, feed-forward neural
networks within each layer further enhance the model's
ability to capture complex patterns in the input sequence
through non-linear transformations.
To facilitate training of deep networks, the transformer
incorporates residual connections, allowing previous layer
information to be preserved and aiding learning. Layer
normalization is employed after each sub-layer to normalize
input and enhance training stability. Masking is applied