Skip to main content
SearchLoginLogin or Signup

YouTube Mirror

Published onAug 29, 2023
YouTube Mirror
·

Figure 1. YouTube Mirror installation

Description

YouTube Mirror is an interactive, audiovisual AI installation that generates images and sounds in response to the images of the audience captured through a camera in real time as a concept of data mirror. YouTube Mirror uses a cross-modal generative machine model that was trained in an unsupervised way to learn the association between images and sounds obtained from the YouTube videos I watched. The machine will see the world only based on this audio-visual relationship, and the audience can see themselves through the machine-generated images and sounds. YouTube Mirror is an artistic attempt to simulate my unconscious, implicit understanding of audio-visual relationships that can be found in and limited by the videos I watched. YouTube Mirror also attempts to represent the possibility of the machine’s bias caused by implicit bias inherent in video recommendation algorithms as well as a small set of personal data.

We try to understand the relationships between images and sounds when we watch videos. Along with the popularity and impact of video-based social media platforms such as YouTube, we watch a plethora of videos, and our video consumption affects how we see other people and the world. What videos we watch are not only determined by our choices but also hugely affected by the recommendation algorithms of the platforms that are designed to make their users remain longer on their platforms. The “watch-to-next” videos suggested by the algorithms are, in general, based on the user’s previous watch history and other metadata related to the videos. Since these data could be implicitly biased or wrongly reflect the user’s behavior or preference, the recommendation models could cause a feedback loop effect, narrowing down the choices of videos the user can find [1][2]. This feedback loop will affect our understanding of audio-visual relationships that we unconsciously find in the videos we watch. This project tries to make a machine simulate these audio-visual associations and represent the world through the relationships the machine learned.

The YouTube Mirror project lies at the intersection of personal data, machine vision, and data art in an attempt to explore how data-driven audiovisual art can represent the machine’s understanding of audio-visual relationships limited by a small set of personal data and the social media platform.

Method

The model architecture of YouTube Mirror is cross-modal Variational Autoencoders (VAEs) with associators, which is based on an approach by Jo et al. [3]. To train the YouTube Mirror project model, there are two steps involved: intra-modal association and cross-modal association. During the intra-modal training phase, a sound VAE and an image VAE are trained with respective sets of image data and sound data extracted from the same videos. In the cross-modal training phase, an associator is trained using pairs of images and corresponding sounds. This is achieved by utilizing the two VAEs that were trained in the previous phase. The associator is also a VAE, but the input and output data for the associator are the latent representation of the original data. Its main goal is to encode the latent space of the image VAE into the latent space of the sound VAE, or vice versa. For example, an associator that is trained with the encoder of the image VAE and the decoder of the sound VAE can generate a sound from a given input image. For this project, only an image-to-sound associator was trained and used.

The reconstruction process involves several steps. First, the image encoder outputs a latent vector based on the input image. Next, the associator maps this vector to the latent vector for the sound VAE. The sound decoder then reconstructs the sound from the latent vector, resulting in a mel-spectrogram. Finally, the mel-spectrogram is transformed into a time-amplitude representation of the sound that can be heard. Additionally, the image VAE's decoder utilizes the latent vector for the input image to produce a reconstructed image.

Further information about the implementation can be found in its separate paper[4].

Video Documentation

YouTube Mirror was exhibited as an installation at the Media Arts and Technology Program 2022 End of Year Show at the University of California, Santa Barbara. Below is the video documentation of the installation.

Video 1. YouTube Mirror installation at MAT EoYS 2022, UCSB

Space Requirements and Floor Plan

Figure 2. YouTube Mirror floor plan

Figure 2 illustrates the floor plan for the YouTube Mirror installation. The installation is designed to enable the audience to interact with the work in real time via a webcam. The machine-generated images will be displayed on a portrait mode monitor, and the generated sounds will be played through 2-channel stereo speakers. This installation plan includes a video grid as a visualization of the video data used for training the model via the background projection. The video grid will be looped on the wall screen with no sound.

Equipment Requirments

The author can bring:

  • A PC

  • A webcam

  • A portrait-mode monitor with a monitor arm

  • An audio interface

The conference needs to supply:

  • 2-ch stereo audio speakers (with cables)

  • A projector

  • A wall screen

  • A pedestal

  • A box to hide the pc & the audio interface

Ethics Statement

The development of YouTube Mirror was funded by the Media Arts and Technology graduate program and the Interdisciplinary Humanities Center at the University of California, Santa Barbara. There are no observed conflicts of interest.

Artist Biography

Sihwa Park is a sound interaction designer, media artist and Assistant Professor in the Department of Computational Arts at York University. His art practice focuses on representing the relationship between humans and machines entwined with data and algorithms, employing interdisciplinary approaches spanning from data visualization/sonification, generative art and machine learning. His work has been presented at international venues, including ICMC, NIME, ISEA, IEEE VISAP, SIGGRAPH Asia, CHI, Ars Electronica, and NeurIPS.

Comments
0
comment
No comments here
Why not start the discussion?