Skip to main content
SearchLoginLogin or Signup

Building a Nature Soundscape Generator for the Post-Biodiversity Future

A pilot machine learning model for generating infinite pseudo-natural soundscapes

Published onAug 30, 2023
Building a Nature Soundscape Generator for the Post-Biodiversity Future


Exposure to nature has been linked to physiological and psychological wellness. As global biodiversity continues to decline, our opportunity to build a meaningful connection to nature follows. The loss of complexity in natural soundscapes is a bellwether for such biodiversity loss. Sounds of nature have been shown to calm our minds, ease our bodies, and connect us to our environments. Thus, we must preserve them for future generations. Here, I present a pilot generative model for the creation of pseudo-natural soundscapes to provide a pseudo-connection to nature for the post-biodiversity world. Trained on thousands of hours of nature soundscapes from across the world, the model demonstrates the ability to generate completely novel digital soundscapes as hybrids of currently existing natural ones. Further work in this area will ensure unlimited pseudo-nature sounds for the future despite biodiversity collapse.


Biodiversity loss is occurring across the planet and across species. A 2019 Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services report found that the average abundance of native species has fallen by at least 20 percent since 1900. This is due in large part to loss of habitat from human development, including tropical and boreal forests and over 85 percent of wetlands [1]. Intergovernmental Panel on Climate Change reports have also identified that climate change processes drive biodiversity collapse, including increasing wildfires and shifts in seasonal timings [2].

Soundscapes reveal much about the natural, geological, and anthropological processes at play in a given area. For example, they convey information about vocalizing species such as birds, insects, and frogs which are crucial indicators of environmental health [3]. Biophony, the sound produced by organisms in a given environment, has decreased in parallel with biodiversity, reducing the complexity of soundscapes globally. A 2021 study by Morrison reconstructed historical soundscapes using North American and European bird survey to study these changes. From the 1990's until 2018, they predicted decreases in several indices correlated with biodiversity, as well as an increase in Acoustic Evenness which is correlated with a loss of biodiversity [4].

Nature and its biodiversity deserve protection for their own sake. However, our relation with nature is also critical in supporting human health, purpose, and community. Many studies have demonstrated the benefits of exposure to nature for improving health through reduction of stress [5], improvement of attention [5], and lowering violent crime rates [6]. Others have found positive correlations with nature exposure and personal reports of meaning in life [7] and sense of community identity [8]. Soundscapes are a critical way in which we connect to our surroundings, and humans, unsurprisingly, prefer natural soundscapes to anthropological ones such as those in cities [9].

If the ecological and climatic trends presented by the IPDES and IPCC continue, biodiversity and the complexity of natural soundscapes will continue to reduce. However, our innate need to connect to nature will nevertheless remain. While virtual natural soundscapes, such as those created by Morrison, could be viewed as an "extension of nature" and not nature in themselves [10], it may come to be that we rely on these as a replacement for the real thing. Virtual natural soundscapes have shown some of the same stress reducing effects as real nature [10][11]. Medvedev, for example, found that the skin conductance (a measure of stress) of subjects listening to natural sounds recovered to baseline levels faster than subjects listening to busy city recordings[11].

The motivation of this study is to continue to provide natural soundscapes provide humanity despite biodiversity collapse, as well as to generate completely novel soundscapes. For this purpose, a convolutional variational autoencoder (CVAE) on approximately 1700 hours of natural soundscapes recorded across the world: Norway, France, Ukraine, Borneo, the United States, and China. The model encodes approximately four-second soundscapes as mel-spectrograms with alternating convolutional and max pooling layers, followed by a single linear layer before the model's latent space. The decoder mirrors the encoder, using alternating convolutional transposes and upsampling layers to return a [128,128] mel spectrogram.

By exploring the latent space of the CVAE, one can generate novel soundscapes that are combinations of the input data. For example, the model can slowly morph between two geographic regions and/or two time periods; several examples of this are illustrated and sonified in this pilot study. This ability to generate a wide array of new nature soundscapes will become increasingly useful if current trends continue, as future generations will have less complex and more homogenous soundscapes globally [4].

A variety of machine learning techniques have been applied in previous works for generation of audio, including variational autoencoders [12], audio waveform diffusion [13], spectrogram diffusion [14], and generative adversarial networks [15]. Methods vary based on application, the amount of available data, and the length and complexity of the audio samples. For example, training on raw audio waveforms can create high-resolution outputs, but may struggle with long-term dependencies. In contrast, training on spectrogram images can create somewhat noisy outputs, but may more easily identify longer-term patterns [16].




Recordings from passive acoustic monitoring performed under the Sound of Norway project , a project in which the author is involved. It includes dawn recordings from June 2023 with many bird vocalizations as well as a prevalence of rain. It consists of 12,000 5-second files after processing.


A dataset of 20,000 10-second samples recorded from autonomous recorders near Ithaca, New York in the fall of 2015. The data were filtered to those labeled as having bird detections by the dataset authors, about half of the files. A total of 11,684 5-second samples were utilized from this source. Downloaded from Zenodo.


A dataset of recordings in the rainforest of Borneo during afternoons in June 2019. Soundscapes are lush and busy and include birds calls and insect drones. Provided by Sarab Sethi and the SAFE project. Consists of 11,760 5-second files after processing.


Recordings from passive acoustic monitoring inside the Chernobyl Exclusion Zone that includes bird, insect, and mammal vocalizations. This data set served as an evaluation set for the 2018 DCASE bird audio detection task. It consists of 10,558 5-second files after processing. Downloaded from Zenodo.


Recordings by Ray Tsu (Xeno Canto ID: FROVFAFTMA) in the Tianlin Community of Shanghai, China. All recordings are from Spring months from 2020-2022. They include multiple species including the Chinese Blackbird, Pale Thrush, and the Yellow-browed Warbler. Consists of 6,224 5-second files after processing. Files downloaded from Xeno-Canto.


Recordings by Stanislas Wroza (Xeno Canto ID: SDPCHKOHRH) in the Grande-Rivière Château commune within the Jura department and Bourgogne-Franche-Comté region of France. All recordings are from evenings in March 2021 and capture the call of the Boreal Owl. Consists of 12,953 5-second files after processing. Files downloaded from Xeno-Canto.


Recordings from a stationary microphone outside the Cornell Lab of Ornithology in Ithaca, New York. Provided by Holger Klinck and Chris R. Pelkie. Two sub-classes are included, a summer class including recordings from dawn hours in July 2018 and a winter class including dawn recordings from January 2016. The summer class includes 16,865 5-second files after processing and the winter class includes 15,685 5-second files.


All data sets were converted to WAV files with a sample rate of 16000 Hz and 16-bit depth. Frequencies below 300 Hz were removed to eliminate low-end noise and increase model clarity on mid and high frequencies and the biophony present in the samples. Using Python, all files were then sliced into 5-second segments. Using the Librosa library, mel spectrograms were generated from each WAV file using a FFT window length of 2048 and a hop length of 512. Spectrograms were converted to decibel scale, then normalized from 0 to 1. The tail of the spectrogram was then cut to give a final square shape of [128,128] for simpler convolutional operations in the CVAE model.

Further preprocessing steps were applied to achieve a normal distribution of input data and ease model training, due to the normal distribution of data in CVAE latent space. First, the data was clamped to a max value of 0.5. Data was then split into a training set (80 percent), validation set (15 percent), and test set (5 percent). Then a min-max scaler with a range of (0,1) was fit to the training set, and applied to transform each of the three sets. The final training, validation, and test sets had 78107, 9763, and 9764 samples, respectively.


The soundscape generator model is a convolutional variational autoencoder built in PyTorch Lightning (SI Fig. 1) with 9.4 million trainable parameters. It was built upon a basic framework of a pytorch variational autoencoder provided by Medium author and data scientist Reo Neo.

The encoder consists of 5 segments, each with a convolutional layer, a ReLU activation, and a max pool layer. Stacking convolutional layers in this fashion in theory allows for capturing of lower level features like borders and edges in shallow layers and capturing of intricacies and patterns in deeper layers. Additionally, as the input spectrogram images are quite large for network of this small size, the max pool layers are used both compress the spectrograms and highlight frequency-time regions with greater energy. A single linear layer of size [2048] connects the encoder to the latent space. The decoder is a reversed version of the encoder, with upsampling layers instead of max pool layers to increase image size.

The latent space consists of two linear layers of size 1024, representing the mean and log variance of the latent distribution. Variational autoencoders, in contrast to standard autoencoders, have a latent space consisting of a high-dimensional normal distribution, which these layers encode. This is intended to create a smoother latent space for sampling and generating new outputs lying between input classes.

The model was trained for 51 epochs using a learning rate of 0.00001, a batch size of 4, and the Adam optimizer. The loss function was the per-image sum of mean squared error (MSE) of validation set, which was found to improve temporal resolution compared to the mean MSE for each image.

Results and Discussion

The results indicate that a simple CVAE model can extract the general characteristics of soundscapes across the diverse data sets. Frequency-axis information, such as droning background sounds (Figs. 1d,1e) or long animal calls (Fig. 1f), were reproduced well. Temporally, the model blurred together staccato patterns (Fig. 1b) and fine details (Figs. 1a,1h) characteristic of animal vocalizations. As the model is trained on raw soundscape data and not on animal call samples, the model may be converging on a solution to reproduce the droning background sounds, morphing any captured animal calls into soundscapes in the process. The resulting sounds have an uncanny quality.

Figure 1: Reconstructions of test set input spectrograms for each data set.

This effect can be heard in Soundscapes 1a and 1b, a sample from the China data set and its CVAE reconstruction, respectively. The essence of the original can be heard in the reconstruction, though detail of individual calls merge together. The model performs well on recreating an owl call from the French data set (Soundscapes 1c and 1d, Fig. 1f), possibly due to the longer notes already closely representing a droning sound.

The sound clips were below were generated from spectrogram images using the Griffin-Lim algorithm, which can create noisy artifacts, some of which were removed using the Adobe Audition Hiss Reduction Tool.

Soundscape 1a: China Spring Dawn test set sample (Fig. 1g)

Soundscape 1b: China Spring Dawn reconstructed sample (Fig. 1g)

Soundscape 1c: France March Evening test set sample (Fig. 1f)

Soundscape 1d: France March Evening sample reconstruction (Fig. 1f)

Novel soundscapes were then generated by performing linear interpolation between sets of two test set samples. Figure 2a and 2b illustrate how characteristics of two distinct soundscapes blend into each other, merging a calm winter soundscape in Upstate New York with a spring dawn chorus call from Shanghai, China, for example. The transition of these two soundscapes can be heard in Soundscape 2b, which plays through each of the five spectrograms shown in Figure 2b. Equalization and light reverb added in Ableton Live.

Figure 2: Latent space soundscape interpolations between two encoded test set samples for a) the France March Evening and Ithaca Summer Morning data sets, b) the Ithaca Winter Morning and China Spring Dawn data sets, and c) between two samples from the Borneo June Afternoon data set.

Soundscape 2a: France to Ithaca interpolations, audio for five spectrograms from Fig. 2a

Soundscape 2b: Ithaca to China interpolations, audio for five spectrograms from Fig. 2b

Soundscape 2c: Borneo to Borneo interpolations, audio for five spectrograms from Fig. 2c

Using the dimensionality reduction algorithm UMAP, the 1024-dimensional latent space embeddings for the test set were reduced to two dimensions for visualization. Figure 3 illustrates the clustering of the data sets in this high dimensional space.

Smooth soundscape interpolations between test set samples were found between within the same data set (Fig. 2c) and in data sets that have overlap in the UMAP plot (Fig. 2b, Fig. 3). However, noise is introduced interpolating between the France and Ithaca Summer Morning data sets, which appear not to overlap in the latent space (Fig. 3). In the future, incorporating Kullback-Leibler (KL) divergence loss in addition to sum of MSE as a loss function could improve interpolations by forcing the model to adopt a more normally distributed latent space. However, in the current architecture, implementing KL loss caused further blurring of features in generated spectrograms.

Figure 3: UMAP 2-D clustering of 1024-D model latent space for all test set samples.


The pilot CVAE model presented here demonstrates the ability to reconstruct samples across a broad data set of soundscapes and shows that new soundscapes can be generated by interpolating between those derived from currently existing ecosystems. However, it struggles with capturing high-temporal resolution information and blurs animal calls into their surroundings. Application of more complex deep learning models, higher audio sample rates, and tuning of the soundscape data set could lead to more convincing and hi-fidelity pseudo-natural soundscapes for future generations to enjoy.


Thanks to the owners of all data sets used to train this model for providing them directly or allowing for open non-commercial usage.

Supporting Information

SI Figure 1: Convolutional variational autoencoder architecture

No comments here
Why not start the discussion?