Multimodal action recognition from video and audio

Our neuroethological research is to reveal clues about the meanings of vocalizations in their broader behavioral context and to infer the strategies used by animals to modify their immature vocalizations towards a sensory target provided by the parents, akin to infants learning to speak. In the past, we have learned much from analyzing animal vocalizations in controlled environments such as single birds housed in isolation (1-3). However, the isolated animal setting is overly impoverished in comparison to the natural setting in which animals grow up in the midst of social partners. The scientific impact of such research therefore is intrinsically limited because data from such a setting cannot account for the social influences to vocal learning that arise in the natural setting of a colony.

The information contained within animal vocalizations can only be understood when relevant multimodal information is considered. Namely, the environment and general behavior of the vocalizing animal and its social partners are of particular importance, both while vocalizations are produced as well as immediately before and thereafter. For example, the directed songs of a male produced towards a female are different and subserved by different brain mechanisms than the undirected songs produced alone (4). And, the song learning success in juveniles seems to depend on ongoing social interactions with non-singing adult females (5). For such reasons, we acquire extensive high-quality audio and video data of vocalizing freely-behaving animals in complex social settings (cf. Computational Ethology) and want to segment and categorize behavioral actions of individuals and social interactions in these recordings.

The annotation of such massive datasets is, however, particularly challenging. Manual annotation is often not feasible as it is too labor intensive. Therefore, we plan to adopt and develop machine learning methods for automated action recognition from video and audio recordings, as well as from wireless sensor nodes that we mount on birds with the main purpose of selectively recording vocalizations of the sensor-wearing bird.

There is a rapidly growing number of methods for action recognition from video and audio recordings, and from recordings using animal-borne sensors (6-11). Many methods are based on posture tracking (6,7,11) to circumvent direct training of classifiers on high-dimensional video data, which can be overwhelming. Action recognition has been facilitated by the recent introduction of deep-learning-based animal posture tracking tools (12-16).

Action recognition from video recordings requires good visibility of the animal in the camera image. In cases where good visibility of an individual on a camera image is not given, a successful approach may benefit from multi-modal action recognition, where the video data is combined with microphone or accelerometer recordings from the wireless sensor nodes.

We are especially interested in courtship behaviors due both to their importance for sexual selection, as well as for the extensive attention from researchers that one particular vocal courtship signal, male birdsong, has received. For songbirds, selecting a partner for copulation and subsequent offspring rearing involves complex courtship displays that include varied vocalizations and coordinated body movements.

Student Project

If you are interested in this project for an MSc Thesis or Semester Project, please get in touch with Linus Rüttimann

References

Kollmorgen S, Hahnloser RHR, Mante V. Nearest neighbours reveal fast and slow components of motor learning. Nature. 2020 Jan 8;577(7791):526–30.
Lipkind D, Zai AT, Hanuschkin A, Marcus GF, Tchernichovski O, Hahnloser RHR. Songbirds work around computational complexity by learning song vocabulary independently of sequence. Nat Commun. 2017 Nov 1;8(1):1247.
Tchernichovski O, Mitra PP, Lints T, Nottebohm F. Dynamics of the vocal imitation process: how a zebra finch learns its song. Science. 2001 Mar 30;291(5513):2564–9.
Woolley SC, Doupe AJ. Social context-induced song variation affects female behavior and gene expression. PLoS Biol. 2008 Mar 18;6(3):e62.
Takahashi DY, Liao DA, Ghazanfar AA. Vocal learning via social reinforcement by infant marmoset monkeys. Curr Biol. 2017 Jun 19;27(12):1844-1852.e6.
Segalin C, Williams J, Karigo T, Hui M, Zelikowsky M, Sun JJ, et al. The Mouse Action Recognition System (MARS): a software pipeline for automated analysis of social behaviors in mice. BioRxiv. 2020 Jul 27;
Fujimori S, Ishikawa T, Watanabe H. Animal behavior classification using deeplabcut. 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE). IEEE; 2020. p. 254–7.
Bohnslav JP, Wimalasena NK, Clausing KJ, Dai YY, Yarmolinsky DA, Cruz T, et al. DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels. eLife. 2021 Sep 2;10.
Zarringhalam K, Ka M, Kook Y-H, Terranova JI, Suh Y, King OD, et al. An open system for automatic home-cage behavioral analysis and its application to male and female mouse models of Huntington’s disease. Behav Brain Res. 2012 Apr 1;229(1):216–25.
Burgos-Artizzu XP, Dollar P, Dayu Lin, Anderson DJ, Perona P. Social behavior recognition in continuous video. 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2012. p. 1322–9.
Perkes A, Pfrommer B, Daniilidis K, White D, Schmidt M. Variation in female songbird state determines signal strength needed to evoke copulation. BioRxiv. 2021 May 20;
Nath T, Mathis A, Chen AC, Patel A, Bethge M, Mathis MW. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat Protoc. 2019 Jul;14(7):2152–76.
Pereira TD, Tabris N, Li J, Ravindranath S, Papadoyannis ES, Wang ZY, et al. SLEAP: Multi-animal pose tracking. BioRxiv. 2020 Sep 2;
Mathis A, Schneider S, Lauer J, Mathis MW. A Primer on Motion Capture with Deep Learning: Principles, Pitfalls, and Perspectives. Neuron. 2020 Oct 14;108(1):44–65.
Walter T, Couzin ID. TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields. eLife. 2021 Feb 26;10.
Badger M, Wang Y, Modh A, Perkes A, Kolotouros N, Pfrommer BG, et al. 3D Bird Reconstruction: A Dataset, Model, and Shape Recovery from a Single View. In: Vedaldi A, Bischof H, Brox T, Frahm J-M, editors. Computer vision – ECCV 2020: 16th european conference, glasgow, UK, august 23–28, 2020, proceedings, part XVIII. Cham: Springer International Publishing; 2020. p. 1–17.

Quicklinks

Main navigation

Multimodal action recognition from video and audio

Student Project

References