Mental states, such as cognition, emotion and action (contexts) can be analyzed and predicted from the eye acquired by a close-up infrared sensitive camera


Tutorial: Introduction to eye and audio behaviour computing for affect analysis in wearable contexts

Duration: Half-day (3h)
Venue: Room: E14-240, MIT Media Lab in Cambridge, MA, USA
Date: 13:30-15:30, 10 Sep, 2023

Multimodal processing for affect analysis is instrumental in enabling natural human-computer interaction, facilitating health and wellbeing, and enhancing overall quality of life. While different modalities, including facial expression, brain waves, speech, skin conductance and blood volume offer valuable insights, modalities like eye behavior and audio provide exceptionally rich information and can be easily and non-invasively collected in mobile contexts without physical movement restrictions. Such rich information is highly correlated with cognitive and affective states, and is reflected not only in conventional eye and speech behavior such as gaze, pupil size, blink, linguistics and paralinguistic, but also newly developed behavior descriptors such as eyelid movement, the interaction between the eyelid, iris and pupil, eye action units, heart and breathing sensing through in-ear microphones, abdominal sound sensing via custom belt-shaped wearables, and the sequence and coordination of multimodal behavior events. The high-dimensional nature of the available information makes eye and audio sensing ideal for multimodal affect analysis. However, fundamental and state-of-the-art eye and audio behavior computing has not been widely introduced to the audience in the form of tutorial. Meanwhile, advancements in wearables and head-mounted devices like Apple Vision Pro, smart glasses or VR make them the likely next generation of computing devices, providing novel opportunities to explore new types of eye behavior and new methods of body sound sensing for affect analysis and modelling. Therefore, this tutorial will focus on eye and audio modality computing, using an eye camera and a microphone as examples, and multimodal wearable computing approaches, using the modalities of the eye, speech and head movement as examples, aiming to propel the development of future multimodal affective computing systems in diverse domains.

This tutorial contains four parts, with the full program shown in the table below.
Overview: Background of present sensing modalities and technologies, and the motivations for eye and audio processing in affect analysis
Part 1: Introduction to camera-based eye behaviour computing for affect
• Wearable devices to sense eye information.
• Eye behaviour types (gaze, pupil size, blink, saccade, eyelid shape etc.) and their relationships with affect.
• Computational methods for eye behaviour analysis.
• Issues in experiment design, including data collection, feature extraction and selection, machine learning pipeline, in-the-wild data, bias.
• Available datasets, off-the-shelf tools and how to get started.
• Future directions and challenges.
Part 2: Audio analysis for affect computing using a single microphone
• Introduction to wearable audio for affect (sensors and wearable audio devices; audio types; relevance to affective computing, applications in healthcare, etc.).
• Exploration of innovative body sound audio sensing for affect analysis.
• Speech and audio processing analysis, machine learning pipelines for affect computing
• Future directions and challenges (emerging trends and technologies, e.g., augmented reality and personalized audio; ethical considerations, e.g. privacy and security, etc.).
Coffee Break
Part 3: Multimodality (focus on eye camera, microphone and IMU sensors)
• Motivation for multimodal approaches (performance increase, redundancy, different types of information, context).
• What multimodal approaches can contribute to assessing affect and cognition (benefits of multimodal specifically in the context of affect/cognition).
• Approaches for multimodal analysis, modelling and system design (fusion, statistical features vs. event feature based, analysis methods). Examples of multimodal system designs and their benefits.
• Applications of multimodal systems and use case considerations.
• Future directions and challenges.
Part 4: Interactive research design activity
Discussion about processing eye and speech/audio behaviour, applications and challenges in practice. Students/researchers will:
• present their own related projects,
• share experience on using different modalities/approaches in their applications,
• discuss future research plans or directions on the modalities, approaches and applications they would like to adopt.

Key references:
D. W. Hansen and Q. Ji. “In the Eye of the Beholder: A Survey of Models for Eyes and Gaze”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 478-500, 2010.
K. Holmqvist, et al. "Eye tracking: empirical foundations for a minimal reporting guideline." Behavior research methods 55.1: 364-416, 2023.
RA, Khalil, et al. "Speech emotion recognition using deep learning techniques: A review." IEEE Access 7 (2019): 117327-117345, 2019.
Y. Wang, et al., “A systematic review on affective computing: Emotion models, databases, and recent advances”, Information Fusion, vol. 83, pp.19-52, 2022.

Other references:
M. Kassner et al., "Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction" In Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing: Adjunct publication, pp. 1151-1160. 2014.
V. Skaramagkas et al., “Review of eye tracking metrics involved in emotional and cognitive processes”, IEEE Reviews in Biomedical Engineering, 2021.
L. Itti, “New Eye Tracking Techniques May Revolutionize Mental Health Screening”, Neuron, 88(3), pp. 442-43, 2015.
M. Kassner et al., "Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction" In Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing: Adjunct publication, pp. 1151-1160. 2014.
V. Skaramagkas et al., “Review of eye tracking metrics involved in emotional and cognitive processes”, IEEE Reviews in Biomedical Engineering, 2021.
B. Laeng et al., “Pupillometry: A window to the preconscious?”. Perspectives on psychological science, 7(1), pp.18-27, 2012.

Tutorial presenters:
Dr. Siyuan Chen, University of New South Wales (Siyuan.chen(at)
Siyuan Chen is a lecturer at the University of New South Wales (UNSW). Her work focuses on using “big data” from close-up eye videos, speech and head movement to understand human internal state such as emotion, cognition and action. She received her PhD in Electrical Engineering from UNSW. Before joining UNSW, she worked as a Research Intern at NII, Tokyo, Japan., a Research Fellow in the Department of Computer Science and Information Systems at the University of Melbourne and a visiting researcher to the STARS team, INRIA, Sophia Antipolis, France. Dr. Siyuan Chen is a recipient of the NICTA Postgraduate and the top-up Project Scholarship, the Commercialization Training Scheme Scholarship, and the Australia Endeavor Fellowship 2015. She has published over 30 papers in high quality peer-reviewed venues and filed two patents. She led a special session in SMC2021 and a special issue in Frontiers in Computer Science in 2021. She also served as a session chair in WCCI 2020 and SMC2021, and was a Programme Committee member of several conferences, such as ACII, IEEE CBMS, Social AI for Healthcare 2021 workshop. She is a member of Woman in Signal Processing Committee. Her work has been supported by US-based funding source multiple times. She was also a recipient of UNSW Faculty Engineering Early Career Academics funding in 2021.

Dr. Ting Dang, Nokia Bell Labs/ University of Cambridge (ting.dang(at)
Ting Dang is currently a Senior Research Scientist in Nokia Bell Labs, and a visiting researcher in the Department of Computer Science and Technology, University of Cambridge. Prior to this, she worked as a Senior Research Associate at the University of Cambridge. She received her Ph.D. from the University of New South Wales, Australia. Her primary research interests are on human centric sensing and machine learning for mobile health monitoring and delivery, specifically on exploring the potential of audio signals (e.g., speech, cough) via mobile and wearable sensing for automatic mental state (e.g., emotion, depression) prediction and disease (e.g., COVID-19) detection and monitoring. Further, her work aims to develop generalized, interpretable, and robust machine learning models to improve healthcare delivery. She served as the (senior) program committee and reviewer for more than 30 conferences and top-tier journals, such as NeurIPS, AAAI, IJCAI, IEEE TAC, IEEE TASLP, JMIR, ICASSP, INTERSPEECH, etc. She was shortlisted and invited to attend Asian Dean’s Forum Rising Star 2022 and won the IEEE Early Career Writing Retreat Grant 2019 and ISCA Travel Grant 2017. She has previous experience in successful bidding of INTERSPEECH 2026 (social media co-chair) and is organizing scientific meetings such as UbiComp WellComp 2023 (co-organizer).

Prof. Julien Epps, University of New South Wales (j.epps(at)
Julien Epps received the BE and PhD degrees from the University of New South Wales, Sydney, Australia, in 1997 and 2001, respectively. From 2002 to 2004, he was a Senior Research Engineer with Motorola Labs, where he was engaged in speech recognition. From 2004 to 2006, he was a Senior Researcher and Project Leader with National ICT Australia, Sydney, where he worked on multimodal interface design. He then joined the UNSW School of Electrical Engineering and Telecommunications, Australia, in 2007 as a Senior Lecturer, and is currently a Professor and Head of School. He is also a Co-Director of the NSW Smart Sensing Network, a Contributed Researcher with Data61, CSIRO, and a Scientific Advisor for Sonde Health (Boston, MA). He has authored or co-authored more than 270 publications and serves as an Associate Editor for the IEEE Transactions on Affective Computing. His current research interests include characterisation, modelling, and classification of mental state from behavioral signals, such as speech, eye activity, and head movement.