AI in Audiovisual Archives: What Kinds of Analysis Are Possible?

AI tools are entering the world of audiovisual archives in great variety, serving different needs and reflecting different approaches. This blog post looks primarily at the AI tools being used for
metadata creation. It includes a basic overview of algorithms, data sets, and the options for open-source vs. proprietary technologies. It also explores technologies used by archives to analyze audio, image, and language.

Blog

28 May 2021

Randi Cecchine

Student Preservation and Presentation of the Moving Image

Themes

Metadata

Image: “Visualisation of keypoint detection on a still from the Desmet collection” from Automatic Annotations and Enrichments for Audiovisual Archives.

A brief note about language

Given how new these technologies are, there has not been time to develop standards around them in archival settings, including standard language to describe them. In the Netherlands, the CLARIAH team refers to them as AVP – Audiovisual Processing technologies. The European Broadcasters Union speaks about Automatic Metadata Extraction (AME). And in the US, the team at AMPPD (Audiovisual Metadata Platform) refers to them as MGM – Metadata Generation Mechanisms – and the Library of Congress refers to them as Machine Learning Applications for Visual/Audio Data Annotation. For the sake of simplicity I will refer to them simply as “technologies” or “tools,” while welcoming further discussion from the archival and AI/ML community about best practices around such language.

Algorithms, Data Sets, and Open Source vs. Proprietary Technologies

An algorithm is designed to recognize specific features, such as the shape of a face or the sounds of a language. Then, through machine learning, it is trained using data sets to recognize patterns on a sufficient number of examples in order to make inferences and label data. In Distant Viewing: Analyzing Large Visual Corpora, Taylor Arnold and Lauren Tilton describe computer vision:

Algorithms allow for the creation of an automatable and trainable code system to view visual materials…. Algorithms for image detection are typically built by showing a computer example images from a set of predefined classes (typically, containing several thousand object types). The computer detects patterns within the images and uses these to find similar objects in new images.

All training data consists of a limited number of examples, and it is always bound by the limitations and biases in that data. Many of the most robust tools have been created and trained by tech giants like Amazon, Google, and Facebook, which have access to massive amounts of globally-produced, well-labelled data. Archives may choose to use these pre-trained models, or use their own collections as training data and then train open-source tools. In deciding what technologies to use, archives may find it easiest to use a commercially available technology, or might be dissuaded because of cost or because the tools don’t do exactly what they need.

In an interview, Jim Duran, Director of the Vanderbilt Television News Archive and Curator of Born-Digital Collections, explained that tools built by companies like Amazon and Facebook to serve business needs such as automating customer feedback or commoditization of social networks may not be well suited for archives.

We have access to pre-built tools, intended for a different type of use than archives and libraries. We are not trying to sell objects, we are trying to describe them to make them accessible. A lot of times the tools don’t really match what we need; they give you results that don’t quite fit. To get the most out of pre-built tools, we have to transform our data to fit their algorithm.

Other technologies are created by computer scientists – often working through academic research groups – and are free for download and use via repositories such as GitHub. Using these open-source tools avoids the cost of commercial tools and allows archives to maintain control over their data, but also requires a high level of expertise and infrastructure. Each technology may also have unique limitations that need to be considered. For example Karen Cariani, Executive Director of the WGBH Media Library and partner with Brandeis University’s Department of Computational Linguistics in the Computational Linguistics Applications for Multimedia Services (CLAMS) project, explained that when using the Kaldi Automatic Speech Recognition toolkit her team found that while the toolkit might be accurate recognizing speech in recordings of one person sitting at a desk, when using it on varied materials that included cross-talk, musical performances, and external noises, “Kaldi tried to treat everything as speech – a helicopter sound – Kaldi tries to turn it into poetry – taking audio waves and turning it into words.”

What kinds of analysis can these technologies perform?

There are countless tools available for analyzing audiovisual materials, from both open-source and commercial providers. Each tool specializes in a single task, and is brought together with other tools in pipelines. The output of the tools could be a simplified audio or video stream, or a textual description of what is in the content, along with a timecode indicator. This information can be added to a database, but it can also be used as the source for further AI analysis. In this way, technologies can be layered upon one another in order to more meaningfully annotate content. This section describes some of the tools that are currently being used by audiovisual archives to analyze audio, image, and language.

Audio analysis

The development of technology to analyze audio, especially speech, is significantly ahead of visual tools. Some audio analysis technologies that are of interest to audiovisual archives include:

Segmentation Technologies, such as the the French Institut national de l’audiovisuel (INA) CNN (convolutional neural network)-based audio segmentation toolkit called The INA Segmenter is described on INA’s GitHub repository:

Allows to detect speech, music and speaker gender. Has been designed for large scale gender equality studies based on speech time per gender.

The INA Segmenter is a unique support for archives as it addresses the specific problem mentioned above when Automatic Speech Recognition tools transcribed non-speech sound into words. The AMP (Audiovisual Metadata Platform) team’s recommended workflow places the Segmenter before other tools, creating timestamps for each segment of audio, and recommending the use of the FFmpeg command line video and audio converter to create a new file that would be analyzed by other tools. The Segmenter tool has the potential to assist in research projects looking at gender in media, such as INA’s own study of Speaking time of men and women on television and radio.

Image: Segmentation workflow example from AMPPD

Speaker Labelling (also known as Speaker Voice Recognition, Speaker Diarization, or Speaker ID) matches voices (such as those of politicians or well-known celebrities) with “personsnames” and marks where their voices appear in a program. The Netherlands Institute for Sound & Vision has used this technology in collaboration with the Dutch company SpraakLab since 2015 to label more than 3000 known speakers as of 2021 that are present in the Common Thesaurus for Audiovisual Archives GTAA. Tim Manders, Senior Media Manager Optimization at Sound & Vision, explains that with such techniques there is always a trade-off between accuracy and recall: the more labels you allow to be present in your catalogue, the more labels will be incorrect, no matter how low the confidence score (the probability the label is accurate). Since Sound & Vision doesn’t want to introduce too many errors into the catalog for its users, Sound & Vision evaluated the confidence score threshold that SpraakLab should implement in order to ensure that at least 90-95 % of the ingested labels are correct. In this way, the archive accepts that it may have gaps in metadata where the confidence score of a labelled segment was below the threshold, and chooses these gaps over errors.

Automatic Speech Recognition (ASR), also known as Speech To Text

The accuracy of ASR has improved exponentially in recent years, and people who speak high-resourced languages such as English are accustomed to ASR technology supporting a variety of daily personal and professional activities, including speech dictation and YouTube subtitling. Archives with collection holdings in high-resourced languages can choose from a variety of ASR options for the creation of transcripts and subtitles of their moving image collections. Languages may be well-served by these technologies because of population numbers, or because their economies are of sufficient interest to developers. At the 2020 IASA - FIAT/IFTA Joint Conference, in the presentation NHK's diversification of search methods using AI, Jun Asana, engineer at NHK (Japan Broadcasting Corporation), explained that they chose to work with the Microsoft Azure product because after testing it against offerings from local Japanese companies, they found that it had the greatest accuracy.

But even with languages that have many tools available, with high accuracy scores, there are still problems with ASR accuracy when encountering regional and cultural linguistic differences. This point was made by many of the people I interviewed in archives in the US and the UK, who pointed out that technologies were primarily trained on white men with regional speaking styles dominant on TV, and didn’t work well on speech and accents of people of other cultures or regions (the Southeast of the US, for example) or children. In the Netherlands, there has been a great deal of industry and government investment in AI technologies, and ASR technology is quite mature. But when the staff at meemoo, the Flemish Institute for Archives, tested the solutions provided by two Dutch companies in a benchmarking project, they learned that the tools were trained to understand the standard form of Dutch found in the Netherlands (ABN, or AN) and were not accurate when it came to understanding the regional accents and dialects of Flemish Dutch, and even less accurate with the speech of immigrants, for example of Moroccan background. Machines are trained to understand standard patterns, and this does not match the diverse reality of today’s societies.

Low-resourced languages may have no ASR technologies available, or the technology that does exist may not have had enough training data and time to be sufficiently accurate for archives’ needs. In response to this lack of available or accurate ASR technologies for low-resourced languages, a report from The European Parliament’s Committee on Culture and Education, The use of Artificial Intelligence in the Audiovisual Sector, explains the importance of addressing this disparity:

Europe is a multilingual society with 24 official EU Member State languages and approx. 60 additional languages (unofficial languages, minority languages, languages of immigrants and important trade partners). According to the Treaty of the European Union, all languages are considered equal. Yet, support for the different European languages through technologies is highly imbalanced with English being the best supported language by far, as the large-scale study “Europe’s languages in the digital age”, published by the EU Network of Excellence META-NET in 2012, has found. Since then, the situation regarding technology support for Europe’s languages has improved but it is still considerably imbalanced.

This topic is of great importance to audiovisual archives, as it presents a large problem as well as an opportunity to become participants in the development of technologies. Virginia Bazan-Gil from Radiotelevisión Española (RTVE), who wrote about her archive’s research into AI in a FIAT IFTA Archival Reads document titled Artificial Intelligence: an Object of Desire, explained to me in an interview that while RTVE’s main language is Spanish, the broadcaster also has content in their co-official languages of Catalonian, Basque, and Galician, which have had very little ASR technology development. She explained that “we are working on automatic subtitling of this content and translation. We are just beginning with this, and we are working with researchers at the university.” Similarly, Lauri Saarikoski, Development Manager at YLE, the Finnish Broadcasting Company, explained that one of the reasons that languages may not be well-resourced is “the fact that not that many samples of a language are lying around waiting for researchers on the internet.” He explained that with only five million Finnish language speakers, YLE’s archives can be really valuable to technology developers who may not have access to enough training data elsewhere. And in Luxembourg, with a language spoken by only six hundred thousand people worldwide, the archives are working with Parliament to provide recordings of Parliament proceedings as training data to develop ASR technology. Alessandra Luciano, Head of film and television collections at the Centre National de l’audiovisuel, Luxembourg (CNA), explained that for the archive and government, being active participants in developing technology is part of the important work of Futures Literacy, and that their role should not be “feeding the Google or Amazon machine.” Similarly, Miel Vander Sande, Data Architect at meemoo, the Flemish Institute for Archives, shared:

I think big tech will never be accurate enough

They have cannon and we are a mosquito

They are so generalized – we are niche and that’s why we need specific partners willing to invest in a niche market.

Image Analysis

Current image analysis (Computer Vision) technology works by analyzing selected still frames from a video file. While this can generate detailed annotations about what appears in individual frames, the technology does not yet analyze the time-based aspects of film or cinematic language. Some tools for image analysis that audiovisual archives are currently using include:

Facial Recognition and Object Recognition technologies are good examples to help explain how artificial intelligence/machine learning works, because many of us have participated in creating the ground-truth datasets for this technology as we upload and tag photos on social media, or complete a CAPTCHA test marking what is a bus or a taxi or a street sign. Some archives are using facial recognition while others have decided that they will not, due to privacy concerns. Some archives are also participating in their own R&D projects to train facial recognition algorithms on unique local or historical collections, such as the Visual information retrieval in video archives project (VIVA), which is working on technology to automatically analyze television footage from the former German Democratic Republic (GDR).

Object recognition certainly offers a lot of promise, especially for archives that license out footage for re-use. But Jean Carrive, Deputy Head of Research & Innovation Department at the French Institut national de l’audiovisuel (INA), explained that this technology can suffer from lack of context. He shared an example of some historical images that were automatically annotated as “people walking down a street,” but when the images were shown to historians they were identified not as people walking down a street, but as a particular political protest. Carrive also mentioned that in a collection of films from the 1950s there were multiple annotations for cell phones. In our interview he wondered about “how to give context to the machines? To give common sense? It’s a problem of artificial intelligence from the beginning. We are far from being able to do that.”

There are a host of other image technologies, including Video Optical Character Recognition (OCR), that can read text from things like lower-thirds or storefronts; Google’s Reverse Image Search, which can trace the history of an image as documented on the web and validate origin; and Shot Detection technologies such as Azure Video Indexer or PySceneDetect, which can detect scene changes. The Sensory Moving Image Archive (SEMIA) project, a collaboration between the Eye Filmmuseum, University of Amsterdam, The Amsterdam University of Applied Sciences, Studio Louter, and the Netherlands Institute for Sound & Vision, has also developed tools to analyze images in a quite different way – rather than identifying known objects or faces, the technology aims to describe sensory information such as color, shape, and texture. The project also aims to bring the results of the analysis to users/audiences through a “generous interface” that inspires creative re-use and exploratory browsing. The SEMIA project is an excellent example of archives partnering in multidisciplinary collaborations to pose unique questions that drive technology development in new directions.

Natural Language Programming (NLP)

NLP is the area of AI and linguistics concerned with training machines to decipher meaning in human language. NLP tools such as Entity Extraction, also known as Named Entity Recognition or Term Extraction, are especially valuable in archival settings because they can help link descriptive data – created by humans or automated tools – to controlled vocabularies. Controlled vocabularies can help formalize the use of descriptive labels, for example by describing an image with only one word when there are multiple options such as car and auto, or Antwerp and Antwerpen. The article Evaluating unsupervised thesaurus-based labeling of audiovisual content in an archive production environment (Link) describes a term-extraction pipeline at the Netherlands Institute for Sound & Vision that analyzes manually created subtitle files to generate a list of terms that map to the Common Thesaurus for Audiovisual Archives (GTAA). The authors describe the priorities of the project:

For the implementation of unsupervised labeling in the archive’s metadata enrichment pipeline, the balance between Precision and Recall, and the matching of candidate terms with the thesaurus have the main focus of attention. (p. 191.)

Partnerships between archives and NLP researchers such as the Computational Linguistics Application for Multimedia Services (CLAMS) project have the potential to build solutions that address the very specific needs of archives. The multidisciplinary CLAMS team is working on using OCR to read slates, and then use NLP tools to map the extracted information (such as director, broadcast date, title) to PBCore cataloging standards. NLP technologies will be increasingly important to help sort out the large amount of new descriptive data being created by AI technologies.

Conclusion

This blog post has given an overview of some of the AI technologies used in audiovisual archives. It described how algorithms are trained on data sets, and how archives have decisions to make in the use of proprietary or open-source algorithms. It looked at various technologies used to analyze audio, image, and language, and the potential for blending these technologies. The next blog post will explore how audiovisual archives are implementing AI technologies.

Context of the Research

The research for this blog series was conducted by Randi Cecchine in the context of a Master’s study in the Preservation and Presentation of the Moving Image program at the University of Amsterdam, and an internship placement at the Netherlands Institute for Sound & Vision. Randi has a background in documentary filmmaking, education, and media literacy. Sound & Vision is a leading archive in implementing AI in its workflows, and is also involved with numerous collaborative projects that touch on these topics, including the CLARIAH Media Suite; NL AI Coalition and its Culture and Media Workgroup; Cultural AI Lab, AI4media; and ReTV/Data TV, amongst others. In many of these partnerships, Sound & Vision plays a strong role in communications and information dissemination. Sound & Vision’s commitment to open culture and the sharing of knowledge makes it a highly fertile environment from which to conduct research. The next blog post will explore how audiovisual archives are implementing AI technologies.

Research

Research was conducted through online video interviews, attendance at professional conferences and webinars, and literature review. Fifteen online video interviews were conducted with professionals working in archives or related projects. Interviewees represented institutions in the United States, the United Kingdom, and Europe. The interviews were conducted between September and December 2020, during a period when people became quite accustomed to interacting through online video conferencing platforms due to the COVID-19 pandemic. The interviewees include:

Alessandra Luciano, Centre national de l’audiovisuel (Luxembourg)
Athina Livanos-Propst, PBS Educational
Matt Eaton, GrayMeta
Raphael Leung, Nesta
Virginia Bazan, Radiotelevisión Española J
James David Duran, Vanderbilt University
Mari Wigham, Netherlands Institute for Sound & Vision
Shawn Averkamp, AVP/AMP
Stephen McConnachie, BFI
Lauri Saarikoski, YLE
Casey David Kaufman & Karen Cariani, GBH
Marco Rendina, Istituto Luce Cinecittà
Jake Berger, BBC Archive
Kalev Leetaru, GDELT project
Jean Carrive, Institut national de l'audiovisuel (Ina)
Debbie Esmans, Matthias Priem, Miel Vander Sande, Rony Vissers, meemoo - Vlaams instituut voor het archief

Additional data was gathered through attendance at 50 online professional conference sessions and webinars including:

ACM Multimedia 2019 Nice, France
ACM Multimedia 2020
Association of Moving Image Archivists (AMIA) 2020
Creative Commons Global Summit 2020
DataTV 2020 webinar
Digital Asset Symposium (AMIA) 2020
Digital Storage Futures 2020
International Association of Sound and Audiovisual Archives (IASA) 2019, Hilversum, Netherlands
2020 IASA - FIAT/IFTA Joint Conference

Newsletter Research

Subscribe to the newsletter Research of Sound & Vision and stay informed of all meetings and activities we do to make our collections accessible for research. The newsletter is in Dutch.