Using AI to Remotely Assess Mood, Emotions & Mental Health

shutterstock_1183713721_reduced.jpg

How we communicate and conduct business will likely forever be changed as a result of the COVID-19 pandemic. For example, video conferencing, once seen as a convenience, has become a vital part of our existence and economy. Applications such as Zoom, WebEx, Hangouts and Skype have facilitated real-time communication between friends, loved ones, classmates, teachers, colleagues, health care providers and employers.

While telecommunication technologies have helped ease our burden during the COVID-19 pandemic, we simultaneously face a looming mental health crisis.

Recently, a United Nations report stated that the pandemic has caused psychological distress worldwide and has called on all countries to make mental health support a key part of their virus response and to outline action points for policy-makers “to reduce immense suffering among hundreds of millions of people and mitigate long-term social and economic costs to society”[1]. Dévora Kestel, Director at the Department of Mental Health and Substance Use at the World Health Organization (WHO), recently stated an immediate need to put measures in place to protect and promote mental health care and noted that medical professionals are also experiencing significant mental health problems linked to the pandemic [2]. Kestel stated

This means developing and funding national plans that shift care away from institutions to community services, ensuring coverage for mental health conditions in health insurance packages and building the human resource capacity to deliver quality mental health and social care in the community.

With a burdened healthcare system and a large population at risk in isolation, innovative solutions are required to help those in need and to facilitate more informed communication. One solution may be to

leverage artificial intelligence to infer state of mental health, mood and perceived level of interest through multimodal continuous passive monitoring of video conferencing sessions. 

Such a system would sit in-between people communicating via video conferencing, passively monitoring one or more subjects during the session, and generate an emotional inference report for one or more parties involved. Figure 1 shows a high level illustration of this configuration in a doctor-patient scenario. Text, speech and video can be displayed on both ends of the conversation. Such a platform would serve as an intermediate monitoring tool, gathering relevant information and applying contextual models to infer emotional state of the patient, which would help guide the doctor in adapting and personalizing messaging and interactions.

Figure 1: Data flow in a sample session between a doctor and a patient via a virtual meeting.

Figure 1: Data flow in a sample session between a doctor and a patient via a virtual meeting.

A sample inference report, represented as a real-time dashboard, is shown in Fig. 2 for the doctor-patient scenario. Highlighted are the patient’s perceived level of interest and mood estimates as the meeting progresses.

Figure 2: Simplified sample dashboard viewable only by the doctor.

Figure 2: Simplified sample dashboard viewable only by the doctor.

Beyond the one-to-one telemedicine scenario, such technology could be extended to one-to-many scenarios such as virtual classrooms or many-to-one scenarios such as loved ones checking in on an isolated parent. In these additional scenarios, the emotional inference models may predict everything from the benign (boredom, not paying attention, confused) to more worrisome emotional response signals such as depression, lethargy, anxiety and stress. Care should be taken in each of these scenarios to address privacy regulations or concerns, e.g., checks put in place to prevent passive monitoring unless all parties agree a priori to usage terms [3]. 

A potential criticism of this system may be that it is superfluous or redundant given that the participants can judge for themselves the emotional state of the other party. There are situations however where the non-expert may value a second opinion, particularly if the AI has been shown to be accurate in predicting states of mental stress. Additionally, such a system may be useful to a presenter or person doing the monitoring in how they themselves are being perceived, albeit through the eyes of an AI system. In short, an AI emotional inferencer would serve a complementary role for human judgement as opposed to replacing or overriding it. The usefulness of such a system may be better understood in the context of the usage scenarios described below:

  • virtual classrooms: the teacher is focused on delivering the content to the students and the interaction time dedicated solely to interpreting mood or interest in the subject will be limited. In addition, the telecommunication session may be a many-to-one scenario, where a single teacher is addressing multiple students simultaneously. Gauging the mood of the classroom might prove helpful in pacing or steering the material being presented, keeping the students more engaged or signaling when they need a break.  

  • checking in on loved ones: multiple grandchildren are excited to say hi to grandma and talk about their day. The AI interpreter could help summarize grandma’s emotional state.

  • telemedicine: Doctors, nurses, and medical practitioners are typically very busy with little time to waste in treating their patients and sometimes are under immense pressure. Of primary concern for them is that the messaging they are looking to convey to the patient is understood and that the patient’s needs and concerns, say surrounding an upcoming surgery or questions regarding medication, have been fully addressed. In dealing with a wide variety of people, and thus a wide variety of personalities, a doctor may need to adapt their messaging, e.g., tone of voice or interaction style, in a way that is optimal for the patient. That may be very tricky given that a doctor may be simultaneously looking at lab results or an X-ray. However, a quick glance at the dashboard (like the one shown in Fig. 2), would let the doctor know how the patient is doing with the information provided. Are they worried? Confused? Such interpretations are perhaps readily assessed in face-to-face meetings but in the virtual world of video conferencing, these subtleties may be a bit more challenging to pick up.

TECHNICAL CHALLENGES 

While human beings are naturally adept at interpreting body language and nuanced facial expressions, is it possible for AI to accurately do so via passive monitoring?

Inspired by lofty goals, such as detecting and thwarting would-be terrorists boarding a flight, the perception in the business world seems to be a resounding: yes, as emotion recognition via AI has become a $20 billion industry. Examples in this space includes Amazon, which is working to enable Alexa to infer emotion solely through audio conversations [4]. Microsoft and IBM now advertise “emotion analysis” as one of their facial recognition products, and other smaller smaller firms, such as Affectiva, Kairos and Eyeris offer competing emotional inference services [5]. However, the promise of using AI to accurately infer emotional state based solely on single modalities has recently come under fire. In 2019, a distinguished panel of experts reviewed over 1,000 studies focused on using facial expressions in labeled images to infer anger, disgust, fear, happiness, sadness and surprise and concluded that the results were biased and not generalizable[6, 7]. The primary author stated subsequently in a New York Times article that,

… faces do not “speak for themselves,” how do we manage to “read” other people? The answer is that we don’t passively recognize emotions but actively perceive them, drawing heavily (if unwittingly) on a wide variety of contextual clues — a body position, a hand gesture, a vocalization, the social setting and so on.

In other words, accurate portrayal of human beings’ emotional state is deduced from multiple modalities, e.g., voice, body motion, gesticulation, eye movement, gait analysis, environment, conversation context, etc. Exogenous variables such as a person’s ethnicity, nationality, age, gender, may also help increase the accuracy of the emotional inference. For example, people from India typically nod their head in ways that Americans might interpret as being negative or saying “no”, when in fact it could mean the opposite [8]. Such differences in cultural communication norms provide serious technical challenges in unveiling perceived emotion. 

SUMMARY

Recent advances in telecommunications have allowed us to stay in contact with one another in the midst of a COVID-19 pandemic. The mental strains of isolation, loss of a job or income, fear of infection, loss of a loved one, and general anxiety about the future is taking a serious toll on people’s mental health, worldwide. Novel solutions are required to help reach out to those in need of support. AI based emotional inference networks may prove helpful in remotely assessing a person’s state of mental health by way of passive monitoring in video conference settings. The typical communication modalities embedded in video conferencing applications (audio, video and text) may provide rich data used collectively to inform deep learning networks. Such technology could serve as a complement to health experts in their assessments, which alternatively may be used to refine faulty predictions, providing a system ripe for iterative improvement over time. As AI emotional inference models become more accurate, their usefulness to non-experts may increase, perhaps serving as the tipping point to encourage those in need to reach out to mental health professionals.