A Novel state-of-the-art approach for Raga detection
Rāga is a fundamental melodic concept in Indian Art Music (IAM). It is characterized by complex patterns. All performances and compositions are based on the rāga framework. Rāga and tonic detection have been a long-standing research problem in the field of Music Information Retrieval. In this paper, we attempt to detect the rāga using a novel feature to extract sequential or temporal information from an audio sample. We call these Sequential Pitch Distributions (SPD), which are distributions taken over pitch values between two given pitch values over time. We also achieve state-of-the-art results on both Hindustani and Carnatic music rāga data sets with an accuracy of 99% and 88.13%, respectively. SPD gives a great boost in accuracy over a standard pitch distribution. The main goal of this paper, however, is to present an alternative approach to modeling the temporal aspects of the melody and thereby deducing the rāga.
Rāga is the central framework of Indian Art Music (IAM). The compositions and melodies are built on a given rāga. Each rāga is associated with a set of rules that define the flow, structure, and salience of notes (svaras). The melodies based on it are expected to adhere to those rules. A rāga will differ from another rāga either by having one or more differences in the set of notes, ascending and descending patterns, note salience, and melodic motifs.
Pitch estimation is the first step towards melody extraction and plays a very important role in musical signal processing. Rāga and tonic estimation typically require monophonic pitch tracking by the lead artist, although tonic estimation is usually enhanced by multi-track pitch estimation.
Another important step in rāga detection is estimating the tonic, as it sets the baseline for all the melody and notes [1]. Unlike several genres of music, typically Western classical music, that use absolute notes, IAM is based on relative notes. A rāga can be performed in any given tonic, and all the notes are defined with reference to this tonic [2]. A tonic is a perceptual quantity [3] which makes tonic detection subjective to different groups of people. The tonic corresponds to the note ’Sa’ (the first note among the Indian solfege syllables [4][5] similar to the note ’Do’ in Western classical music.
Rāga forms the basic framework for melodies and motifs in IAM [6]. A rāga can be thought of as a collection of melodic phrases and is typically performed in one tonic chosen by the lead artist throughout the performance. IAM typically involves improvisations of a rāga throughout a performance. A rāga can also involve continuous pitch variations in the form of Meend (gradual ascent or descent from one note to the other), Gamaks (embellishment that is placed on a note or in between two notes) [7], articulations, and different ascending and descending patterns [8], variations in the importance of each note. Unlike a pitch, a rāga is not a local feature, i.e., a piece of music has to be listened to for several seconds or minutes depending on the rāga progression and the listener’s expertise in order to deduce the rāga. There have been various studies in the field of rāga detection, which is one of the most researched topics in IAM. Rāga modeling can be thought of as modeling a sequence of pitch values. Several rāga detection techniques are based on pitch histograms, and a handful of them model temporal structures. Temporal characteristics of a rāga are especially important in modeling the asymmetry between note transitions, which is important in distinguishing certain rāgas that share the same set of notes. And as there is no standard test data set available, all approaches to rāga recognition are evaluated on different datasets; a majority of them use a reduced number of classes, take a memory-based approach, or use a simple set of rāgas easy for classification.
Several computational approaches towards pitch tracking have been studied, and some of the best results have been obtained by approaches that work on the frequency domain and utilize spectrograms and cepstrum [9]. PRAAT [10] is based on a normalized cross-correlation function and has been a popular choice. YIN [11] and an equivalent probabilistic version of it, pYIN [12] that use Hidden Markov Models have also been shown to give some of the best results. A more recent approach, CREPE [13] which we have used, is a highly efficient data-driven monophonic pitch estimation model based on Convolutional Neural Networks that works with raw data as opposed to candidate-generating functions like cepstrum or auto-correlation. CREPE [13] being one of the best models with an open-sourced Python implementation, is readily and easily available, motivating us to choose this over other pitch estimation approaches during real-time inference.
The tonic estimation is quite important, as tasks like rāga recognition [14][7][15][16], intonation analysis [17][18], and melodic motif analysis [19] are dependent on the right selection of the tonic. In the case of IAM, however [20], [21] uses melodic characteristics where the information regarding the amount of oscillation around each note called Gamak is considered for tonic estimation. This is especially useful in Carnatic, where the usage of Gamaks is quite strong. In a typical concert, a drone instrument, usually ’Tanpura’, ’violin’, or strings reinforces the tonic for the performer and the listener. [22] and [23] use this to enhance tonic estimation. [21] uses rāga information as a way of backtracking to find the tonic. Of all these multi-pitch approaches, [22] and [23] generally have the best accuracy.
As one can imagine, a basic feature would be to extract the set of notes that describes the rāga, [20] does this by explicitly extracting the set of notes for rāgas. This is done for a handful of rāgas and this procedure does not extend to extracting melodic and temporal aspects of melody. As mentioned earlier, the salience of notes is also an important aspect of a rāga. A majority of studies have focused on extracting pitch distributions or similar variations of them [7][24][25]. This is usually done by taking the distribution of the 12 bin Pitch Class Distribution (PCD) or the fine-grained Pitch Distribution (PD) of each sample and comparing them with each other. This also captures the salience of pitches, which is important in rāga detection as one of the distinguishing features is Vadi and SamVadi (the first and second important note in a rāga). These approaches have shown to be fairly good in rāga classification tasks, a notable approach by [7] presents this approach on 23 rāgas with
Evaluating our approach on all the above approaches is a rather difficult task due to different evaluation techniques, different data sets, and different environments. We nevertheless compare our accuracy with TDMS [16]. We use the same model and evaluation techniques as in TDMS for a fair comparison.
In this paper, we present a novel approach to extracting features that explain the sequential behavior of pitches and a K-Nearest Neighbours model that is trained on these features for rāga classification. The project, implemented in Python, along with the pre-trained models for Hindustani and Carnatic, is open source 1 and includes functions for easy evaluation and reproducibility.
As rāga detection requires pitch and tonic estimation as prerequisites, we first obtain the pitch values from the annotated pitch files that are available in CompMusic 2 [29]. These files are used for training and testing. The pitch frequencies are converted to a 720-d vector spanning 6 octaves corresponding to pitches from C2 to B6, which amounts to 120 pitch values spanning an octave with 10 cents granularity. During run-time inference, however, we use CREPE [13] where the raw audio is sampled with a frame width of 1024 and a hop size of 30 ms. We compute the pitch values and apply Gaussian blur to each pitch value as described in CREPE [13]. Still, since rāga detection does not explicitly depend on octaves, we make it octave-independent by reducing the pitch vector by taking the sum over 6 octaves, resulting in a 120-d vector.
The tonic frequencies are also obtained from the annotations that are available in CompMusic. These frequencies are converted to a 120-d vector similar to pitch frequencies. The index corresponding to the max value in this vector is considered the tonic pitch value, to be used later to reorder the pitch values relative to this tonic.
Generally, for a new piece of audio containing a rāga (not already analyzed by CompMusic [29]), any pitch and tonic estimation system could be plugged into our rāga model to get rāga prediction.
Sequential Pitch distributions are obtained by taking the distributions of pitch values between any two given pitch values, namely
The below set of rules describes the positive feature of SPD.
A histogram of a sequence is taken when the sequence starts at index
The pitch value at
The pitch value at
All pitch values between
Here
Once valid
Since there can be cases where there are no valid sequences, a simple pitch distribution is extracted for the given audio sample instead. This proved to be extremely helpful in shorter audio samples, which improved the accuracy of short audio samples by
With the relaxations in place, it turned out that it is quite sufficient if
We define a modulo function
and a function
The split
The split
Where,
The tensors
Tradition | TDMS (%) | SPD-KNN (%) | PD-KNN (%) |
---|---|---|---|
Hindustani | 97.67 | 99 | 90.66 |
Carnatic | 86.7 | 88.13 | 72.33 |
Table 1. Accuracy comparison between TDMS, SPD, and Simple Pitch Distribution (PD) with the KNN model
An example of extracting a subset of SPD and the pitch values (obtained from CREPE) for a popular song, Bhaje Sargam, is shown in Figure 4. Because of the restrictions imposed by the paper's dimensions in showing the actual pitches obtained, we have shown the pitch values of the main chorus of the song re-sampled at 1 sample per second and played at 60 beats per minute, and we have also rounded the pitch values to the nearest 12 standard chromatic scale pitch values for simplicity. The pitch values for this song have also been verified by experts in the field. The SPD subset of the transition taken from M to n in the positive direction is obtained by taking the distribution from time index 12 to 14 in Figure 4, similarly, the SPD subset of the transition taken from n to M in the negative direction is obtained by taking the distribution from time index 21 to 23 in Figure 4. This is one of the SPD subsets that shows an important characteristic of the rāga showing the dissimilarities in the transitions. The distribution is not as smooth as the distributions shown in Figure 3 as the audio sample shown here has very few pitch values. Another example of the positive direction and negative direction features for the transitions between Ma’ (60) and Ni (110) for the rāgas Puriyā dhanaśrī and Śrī is shown in Figure. 3 (top row). Although both rāgas share the same set of notes, the features clearly indicate dissimilar positive direction features and very similar negative direction features, as expected. Another example for rāgas Kāpi and Ānandabhairavi for the transitions between Pa (70) and Sa (0) is shown in Figure. 3 (bottom row). The positive and negative direction features show important differences between the rāgas even though both rāgas share the same notes. Also, note the positions of prominent peaks that correspond to the salient pitch values.
11 features of shape
Where
The accuracy for different values of
The rāga datasets were requested from CompMusic. The dataset contains separate Hindustani and Carnatic datasets. The Hindustani datasets consist of 300 files for 30 rāgas, with 10 files for each rāga spanning 130 hours. The Carnatic datasets consist of 480 files for 40 rāgas, with 12 files for each rāga spanning 124 hours. The pitch files corresponding to them have a sampling rate of 4.44 ms, and the datasets also include the tonic frequencies for each file. The datasets are well balanced in terms of rāga, artists, and compositions and are completely vocal performances.
During the training process, SPD is calculated for each audio sample. Each of the 25 KNN models is trained in a LOOCV manner. The predictions by individual KNN models are combined in a weighted average, whose weights are calculated using a single-layer neural network. The SPD is cached for all the files, and we also cache the indexes
Nearest Neighbours (K) | Distance (D) | Relaxation (R) | |||||||
---|---|---|---|---|---|---|---|---|---|
Tradition | 1 | 3 | 5 | 7 | 0 | 2 | 4 | ||
Hindustani | 97.66 | 98.67 | 99 | 98.33 | 97.67 | 99 | 91.33 | 94.67 | 99 |
Carnatic | 84.16 | 86.04 | 88.13 | 84.76 | 84.38 | 88.13 | 76.25 | 80.02 | 88.13 |
Table 2. Accuracy (%) for different nearest neighbors in the KNN model (with
The accuracies for different approaches have been shown in Table. 1. The PD-KNN column contains the accuracy of Pitch Distribution with KNN (without SPD). The poorer performance on the Carnatic dataset is likely because of the relatively shorter audio files and higher number of classes compared to the Hindustani dataset.
We also calculate the Bhattacharya distance between positive and negative direction features of each rāga as shown in Figure. 5 for the Hindustani dataset. Some rāgas that have asymmetrical positive and negative directions are Des, Basant, and Miyan Malhar to name a few. We observe higher distances for rāgas that have asymmetrical positive and negative direction patterns, as expected.
We also analyze the predictions produced by our models. The confusion matrix of the predicted rāga labels for the Carnatic dataset is shown in Figure. 6, there are 57 incorrectly classified recordings. In general, we find that the confusions are in the rāgas in the sets {Kāmbhōji,Harikāmbhōji}, {Hussēnī,Mukhāri} and { Madhyamāvati, Śrī, Atāna}. These rāgas share a common set of notes and similar phrases [30]. For the Hindustani dataset, there are only 3 that are incorrectly classified. For 2 cases, the confusion is in the rāgas in the sets {Darbāri, Mārvā}, {Bāgēśrī, Khamāj} the error is in the estimation of the tonic.
The method that we illustrated is not limited to Indian music but can be applied to any music genre that follows a tonic-rāga scheme. Also, SPD could be used for analyzing melodic phrases, motifs, and patterns to get a deeper understanding of a rāga.
One major improvement we see is the accuracy improvement for Carnatic-like traditions where the usage of Gamaks is quite strong, as noted above. We have considered
I would like to pay our gratitude and respect to my father, Shri. Narasinh Hegde passed away due to COVID-19. He inspired me, taught me music, and helped me in several ways while working on this project. I also thank Sankalp Gulati for mentoring me and helping me improve this paper. Also, thanks to CompMusic for sharing the datasets.