Skip to main content# Statistical evaluation of abc-formatted music at the levels of items and corpora

# Abstract

# Introduction

# Methodology

## Possible distances between items and collections

## Comparing collections

## Comparing an item to a reference collection

# Application and evaluation

## Pairwise distances within and between collections

## Comparing a collection to a reference collection

## Comparing an item to a reference collection

## Visualizing the collections

# Discussion and future work

# Acknowledgments

Published onAug 29, 2023

Statistical evaluation of abc-formatted music at the levels of items and corpora

This paper explores three distance measures and three statistical tests for the comparison of music expressed in abc format. We propose a methodology that allows for an analysis at the level of corpora (is the “style” represented in a corpus the same as that in the another corpus?) as well as at the level of item (is the “style” of an item that of the “style” represented in a corpus?). We estimate distributions of distances between item pairs within and between corpora, and test hypotheses that the distributions are identical. We empirically test the impact of distance measure and statistical test using a corpus of Irish traditional dance music and a collection of tunes generated by a machine learning model trained on the same. The proposed methodology has a variety of applications, from computational musicology, to evaluating machine generated music.

Musical style imitation (SI) systems analyze a collection of music exhibiting a particular style and then generate new music in that same style. The engineering of such systems has been an active area of research in artificial intelligence and music since the 1950s [1]. Early systems, like David Cope’s Experiments in Musical Intelligence (EMI)[2] and Kemal Ebcioğlu’s CHORAL [3], rely on the recombination of pre-composed material, or hand-crafted rules, to capture the style of composers such as Bach and Mozart. More recent systems apply data-driven approaches, such as deep learning techniques [4][5][6]. The development and use of such systems are also attracting a significant amount of investment to meet the need for royalty free and adaptive music in games and online videos for social networks.1

The output of SI systems can be evaluated in various ways, from comparing statistics [7][8][9][10][11], to music analysis and in situ testing [12][13][14], to listening tests [15][16]. Since human labor is expensive, and since the scale of material to be assessed can be orders of magnitude larger than possible for human scales of memory and attention, *and* since the development timeline of SI systems can be rapid, automating the “critique” of SI system output becomes necessary. One approach is *CAEMSI* [17], which compares two music corpora via permutation testing with a domain-independent distance metric, i.e., the Normalized Compression Distance. Yang and Lerch [8] proposes probabilistic measures of a variety of musically motivated features for comparing collections of MIDI-formatted music. While all of these methods focus on comparing collections, *StyleRank* [10] is able to assess the stylistic similarity of a particular item to a collection expressed in MIDI via machine learning models trained on features describing melodies and chords.

Other works related to symbolic folk music similarity include the study by Carvalho et al. [18], where they introduce a novel approach to encoding, analyzing, and modeling Iberian folk music. They propose a similarity-based interface that allows for an intuitive exploration of a database by visualizing songs on a 2-D plot using a dimensionality reduction algorithm. The similarity measures they use follow musical criteria such as melody, rhythm, and structure. Janssen et al. [19] conduct a study comparing various similarity measures appearing in music research across different domains. Their aim is to accurately identify melodic segments in folk songs. The measures they examine include correlation distance, city block distance, Euclidean distance, local alignment, wavelet transform, and structure induction. To evaluate the measures, they compare the generated phrase annotations against annotated phrase occurrences in a corpus of Dutch folk songs, using a majority vote from three annotators to determine agreement. Additionally, they investigate how the choice of music representation influences the success of these measures and assessed the robustness of the most effective ones when applied to subsets of the data.

In contrast to the above, we wish to design a methodology for determining what items of a music collection can be considered outliers, or in some sense uncharacteristic, of the styles exhibited by itself or another collection. In our approach, we adapt the domain-agnostic approach of CAEMSI to answer questions about items in collections. We focus in particular on the 365 double jigs of O’Neill’s *The Dance Music of Ireland: 1001 Gems* (1907) [20], and several thousand “imitations” generated by the SI system folk-rnn (v2) [21][22]. The features used in [10] are not relevant to this kind of music, which does not explicitly feature chords or harmony. We instead stay in the domain of representations used by folk-rnn, i.e., either abc notation2 or a 411-dimensional vectorization of the internal representation of the folk-rnn (v2) model for an output. In this paper, our exploration is limited to comparing abc-notated strings or the internal representation of folk-rnn (v2). However, it is important to note that our proposed method is not limited to these specific representations. It is designed to be domain-agnostic and can be applied to other popular representations such as MIDI or musicXML. We examine three distance measures and three statistical tests for comparing collections and their items. We present a variety of results, and discuss future directions of this work – both in the domain of music generation, but also computational musicology.

Let

M:6/8 K:Cmaj C > c c c E F | G A F E F D | C > c c c E G | F A G F E D | C > c c E > c c | D > c c C > c c | e c A G F E | D A G F E D :| |: E D E C z C | C E G G F D | E D E C 3 | e c A A B c | E D E C z C | C E G G F E | F > A F E > G E | D > A G F E D :|

and one from

M:6/8 K:Cmix |: G c c G 2 F | G c B G 2 F | D E F D E F | G F D F 3 | G c c G 2 F | G c B G 2 F | D E F B G F | D C B, C 3 :| |: c 2 d e c A | G A B G 3 | F D D B 2 G | F D D F D B, |c 2 d e c A | G A B G 3 | F D D B 2 D | D C B, C 3 :|

Is this item of *The Ai Music Generation Challenge 2020* [14]. In particular, “The Jolly Joker” is deemed a very poor tune from O’Neill’s 1001 – a collection that “has plenty of poor and dull tunes” said one of the judges. And the folk-rnn (v2) tune No. 8091 won the first place award, with one judge saying, “If you heard [the tune] in a session, it wouldn’t stick out; it would feel comfortable with all the old tunes that are played”.

To answer these questions, we first define a distance metric between any pair of items within and between these collections, and then consider statistical methods for testing membership. Of course, the relationship between the musical content of these strings and any distance computed between them is debatable, but this is not necessarily a problem in our case since the folk-rnn model is merely modeling the syntax of tokenized symbolic transcriptions found in O’Neill’s and similar music, and it is the syntax exhibited by these collections and their items what we wish to test.

The normalized compression distance (NCD) [23] can be used to compare the similarity between two strings. Essentially, it measures the length of the shortest binary program that can compute one string from another and vice versa. This is defined by

$N(x, y)=\frac{K(x+y)-\min (K(x), K(y))}{\max (K(x), K(y))} \tag{1}$

where

The Levenshtein distance (also known as the edit distance) is a measure of the difference between two sequences [24]. It is the minimum number of single-character insertions, deletions, and substitutions required to transform one string into the other, where each operation is has a cost (in our case, we set each cost to be one). The normalized Levenshtein distance

The cosine distance measures the angle between two non-zero-length vectors of the same dimensionality. The cosine distance between two vectors

$C(\mathbf{x}, \mathbf{y})=1-\frac{\mathbf{x}^T\mathbf{y}}
{\|\mathbf{x}\|_2\|\mathbf{y}\|_2} \tag{2}$

where

We now propose ways to compare collections. Denote

The Mann-Whitney U test (MW-test), also known as the Mann-Whitney-Wilcoxon or the Wilcoxon rank-sum test, is a non-parametric test used to compare two independent groups of samples to determine if they come from the same population [25]. Specifically, the test is used to examine if the two groups have the same median or if one group tends to have larger values than the other. An in-depth formulation of the MW-test is found in [26]. The MW-test makes four assumptions: the data must be ordinal, the two groups must be independent, each sample must be mutually independent, and the two distributions should have the same shape (although this last assumption is not a strict requirement). The null hypothesis is that the two groups come from the same population, which is equivalent to saying that the medians of the two groups are equal. This null hypothesis implies that the two groups have the same probability distribution, without specifying the common distribution.

The Kolmogorov-Smirnov test (KS-test), on the other hand, is used as a goodness-of-fit test to check if a certain set of continuous values follows a certain theoretical distribution or to compare if samples from two groups have the same distribution [27]. This test relies on the maximum absolute difference between the cumulative distribution functions of samples between the two groups. Unlike the Mann-Whitney U test, the K-S test can tell if a sample follows a certain distribution or not. When comparing two samples, the K-S test is sensitive to both the shape and location of the two samples, making it robust for comparing if the two distributions are the same. Under the null hypothesis the maximum absolute difference between the cumulative distribution functions follows a Kolmogorov-Smirnov distribution.

The key difference between the Mann-Whitney U and K-S tests is that the Mann-Whitney U test compares the two groups on the basis of a measure of central tendency (usually the median), while the K-S test compares on the basis of statistical distance and is usually used as a goodness-of-fit test to check if the data follows a certain distribution. While the Mann-Whitney U test is applicable when the data is ordinal and independent, the K-S test can be used when the data is continuous and can follow any distribution.

Finally, the Kruskal-Wallis H test (KW-test), considered an extension of the Mann-Whitney U test, is a non-parametric statistical test used to compare the medians of two or more independent samples [28]. It ranks the samples in all groups combined and calculates a test statistic that measures the difference between the ranked medians of the groups, without requiring the assumption of normality or equal variances.

We now examine how an item *reference* distribution of elements in *calculated* distribution of these distances using the same kernel density estimation technique. To determine whether

We can also compare

We now evaluate the distance metrics above and the statistical tests for the collection of O’Neill’s 365 double jigs (

We calculate

Immediately clear from these images is the differences within the collections. The items in

We now perform a variety of statistical tests on the distributions of the distances for a random 80% sampling of

Tables 1 and 2 present descriptive statistics of the computed distances, giving a view of how the distance metrics differ. At a first glance, it seems like the less reliable distance would be the cosine distance, as a higher standard deviation indicates more variability and less consistency in the measurements, making it harder to predict.

Distance | N | Mean | Std. Dev. | Minimum | Maximum |
---|---|---|---|---|---|

ncd_zlib | 42486 | 0.674 | 0.062 | 0.148 | 0.905 |

ncd_lzma | 42486 | 0.486 | 0.067 | 0.051 | 0.758 |

Levenshtein | 42486 | 0.418 | 0.077 | 0.023 | 0.764 |

cosine | 42486 | 0.202 | 0.126 | <0.001 | 0.636 |

Table 1. Descriptive Statistics of

Distance | N | Mean | Std. Dev. | Minimum | Maximum |
---|---|---|---|---|---|

ncd_zlib | 85264 | 0.664 | 0.060 | 0.467 | 0.904 |

ncd_lzma | 85264 | 0.496 | 0.053 | 0.311 | 0.770 |

Levenshtein | 85264 | 0.573 | 0.090 | 0.286 | 0.865 |

cosine | 85264 | 0.205 | 0.119 | 0.006 | 0.643 |

Table 2. Descriptive Statistics of

For the KW-test and the KS-test we use a two-tailed alternative, meaning that the only thing that we test for is whether the means of the distributions underlying the samples are unequal. The results in Table 3 indicate that most pairwise distances are useful to differentiate between

However, when the alternative hypothesis is that the distribution underlying

Distance | ncd_zlib | ncd_lzma | Levenshtein | cosine |
---|---|---|---|---|

Kruskal-Wallis statistic | 819.170 | 1241.844 | 58043.036 | 96.965 |

Asymp. Sig. (p-value) | <0.001 | 0.115 | <0.001 | <0.001 |

Kolmogorov-Smirnov statistic | 15.794 | 27.607 | 117.831 | 11.542 |

Asymp. Sig. (p-value) | <0.001 | <0.001 | <0.001 | <0.001 |

Table 3. Results of the two-tailed Kruskal-Wallis test and the Kolmogorov-Smirnov test, comparing

If we look at the statistics, normalized Levenshtein has the highest value, which indicates that it is the distance metric that is the best at differentiating between

In order to compare an element to collections

Distance | ncd_zlib | ncd_lzma | Levenshtein | cosine |
---|---|---|---|---|

| 0.779 | 0.756 | 0.787 | 0.804 |

| 0.692 | 0.669 | 0.626 | 0.735 |

| 0.799 | 0.796 | 0.861 | 0.804 |

| 0.765 | 0.768 | 0.770 | 0.794 |

Table 4. Mean-area-of-intersection for each distance, for all combinations of reference distributions and calculated distributions.

We can then classify unseen elements by labeling them depending on which mean area intersection is the estimated area of intersection it is closest to. We evaluate this procedure by calculating several metrics on the test set, including accuracy, precision, recall, and F1 score. The results are consistent with the statistical tests, as normalized Levenshtein outperforms the other distances.

Distance | ncd_zlib | ncd_lzma | Levenshtein | cosine |
---|---|---|---|---|

Acc | 0.582 |
| 0.582 | 0.596 |

Prec | 0.561 |
| 0.552 | 0.565 |

Recall | 0.753 | 0.767 |
| 0.836 |

F1 | 0.643 | 0.655 |
| 0.674 |

Table 5. Evaluation of the classifier based on the mean-area-of-intersection on the test set.

To visually compare the two collections of tunes, we use t-distributed Stochastic Neighbor Embedding (t-SNE) [29] and Uniform Manifold Approximation and Projection (UMAP) [30] algorithms. We can represent the items in

We can use the visualizations to identify “typical” tunes and “outliers” for both *” *(from

M:6/8 K:Cmaj e f g e c c | e c g e c c | f a f e f d | e f g a g f | e f g e c c | e c g e c c | f e f d B d | c e c c 2 g | c' 2 g g a b | c' g g g f e | f 2 f a b f | f g a b a g | c' g g g f e | a f a g e c | e f g a g f | e f d c 2 a :|

is missing the repeat signs, which are characteristic of jigs. If we take a look at Image 8, it becomes more evident that tune no. 1923 is an outlier in *” *is still an outlier in the main cluster of

Much more work can be done in this direction, for instance, isolating portions of a collection that are overlapping with a reference corpus in order to select stylistically similar items.

Our paper examines various distance measures and statistical tests to compare items in corpora expressed with the abc notation format. These comparison methods combine elements of other methods, such as CAEMSI [17] and StyleRank [10], but we aim to do more than just compare collections. One objective is to provide a method for curation, filtering one collection in reference to another. Another is to identify “typical” elements and “outliers” within a collection. These approaches could be used in the machine learning pipeline as a proxy evaluation of a sequence model at points in its training. They could also augment musicological studies done by hand of tune collections, as done on O’Neill’s “1001” by Doherty [31] for instance. By using these studies as a benchmark, we can further evaluate the results obtained from the same collection in a more musically meaningful manner.

At the moment the distance measures we are using are focused only on sequences of tokens, or comparing statistics of vectors. One problem with the Levenshtein distance could be a sensitivity to musically irrelevant transformations, such as implicit repeat signs, first and second endings, and an anacrusis. In the future, we would like to explore domain-dependent distance measures, and considerations of the form of these kinds of tunes: that they have parts, and the relationship between the parts could provide information about the membership to a collection. Other future work includes exploring in-depth the impact of the choices made when comparing distributions, like the choice of kernel distribution estimation or the area of intersection. Furthermore, we intend to expand the scope of our study to include other music “styles”. We also plan to investigate BERT representations and develop a Fréchet Symbolic music distance. The goal would be to develop a robust method for comparing items in a collection to a reference corpus of real tunes in a symbolic format. Another possible extension is to apply principles of explainability to identify what it is about a tune that makes it an outlier. Finally, our research may be employed to enhance the effectiveness of machine learning algorithms that analyze and compare items in a collection to a corpus.

This work was supported by the grant ERC-2019-COG No. 864189 MUSAiC: Music at the Frontiers of Artificial Creativity and Criticism.