
Using GenAI avatars to assess empathy: how it works in practice
You may also like
Popular resources
For all the talk about artificial intelligence transforming assessment, most applications still gravitate toward what is easiest to automate: marking written work, proctoring exams or generating feedback at scale. Far less attention has been paid to one of the most difficult things to assess well in education and training – human communication. Educational technology does have a tendency to emphasise knowledge and written outputs over human connection and interpersonal skills.
Recently, my colleagues and I trialled an AI‑enabled assessment, designed to evaluate empathetic communication and motivational skills in a context where these qualities are not optional, but essential: befriending lonely older adults living alone in the community. The trial was part of a wider effort to improve how we fairly and clearly assess skills in roles that work closely with the community.
Designing an assessment for human connection
The assessment focused on the skills required to engage socially isolated elderly people in their homes – listening attentively, identifying needs, motivating participation and responding empathetically, even when interactions are difficult. We mapped these skills to defined behavioural indicators, to support consistent judgement.
- Conversations with bots: teaching students how – and when – to use GenAI for academic writing
- Spotlight guide: AI and assessment in higher education
- How interactive orals transform assessment – and how to implement them
Instead of using role-players or standardised patients, we programmed an AI avatar to assume the persona of an elderly man: grumpy, resistant to conversation and pessimistic about life. He expresses little interest in activities, rebuffs suggestions and challenges the test candidate’s patience and assumptions. The goal was not to “win him over” quickly, but to see how candidates navigated emotional resistance, built rapport and adapted their communication strategies over time.
Candidates engaged in a live conversation with the avatar. We assessed their performance against a rubric covering empathetic communication, active listening, needs identification and ability to motivate.
What mattered most was not what candidates said, but how they responded – to silences, to negativity and to cues embedded in the conversation.
Authenticity matters more than sophistication
One of our earliest lessons was that building the persona mattered more than building the technology. The avatar did not need to sound flawless or emotionally rich in a cinematic sense. Every conversation could be recorded and reviewed later as evidence. What it needed was behavioural credibility.
We spent significant time studying common interaction patterns with lonely elderly people: deflection, repetition, cynicism and reluctance to accept help. These patterns were deliberately embedded into the avatar’s responses. In doing so, we aimed to replicate the emotional texture of real-world encounters rather than an idealised dialogue.
This attention to realism paid off. Many candidates reported that the interaction felt “surprisingly real”, particularly when the avatar appeared to listen to their responses and, at times, push back or rebuke them. The discomfort this generated was intentional. After all, in real settings, empathetic communication rarely unfolds smoothly.
From ephemeral conversations to assessable evidence
One of the most significant advantages of the GenAI‑enabled approach was the ability to capture the entire conversation.
Unlike live role‑plays, which rely heavily on assessor observation and note‑taking, the system produced full transcripts and AI‑generated summaries of each interaction. This allowed assessors to review conversations in detail, revisit specific moments and evaluate performance against criteria more consistently.
The GenAI system also conducted a first‑round evaluation using the assessment rubric. However, this was never intended to replace human judgement. Instead, it functioned as a triage layer – flagging potential strengths, weaknesses and anomalies for closer review.
Having humans firmly “in the loop” proved essential.
Why human oversight still matters
The transcripts revealed limitations that would have been easy to miss in a fully automated system.
In one case, the tool rated a candidate as incompetent. Upon review, assessors discovered that the avatar had become stuck in a repetitive conversational loop, responding with the same phrase regardless of the candidate’s input. The candidate had, in fact, demonstrated reasonable strategies – but the interaction itself was flawed. Our decision was to invalidate the test rather than fail the individual.
In another case, a conversation appeared unusually short. Initial suspicion fell on a system error. However, when assessors reviewed the transcript and followed up with the candidate, it became clear that the system had captured the interaction accurately. The candidate had disengaged halfway through, struggling to formulate open‑ended questions once they’d used their initial prompts. In this instance, the assessment revealed a genuine skills gap.
There were also cultural nuances. It was common for candidates to attempt to build rapport by referencing local dishes or asking about favourite hawker dishes – a conversational strategy typical in Singapore, where food is an important connector across communities. While human assessors viewed these efforts as positive and culturally sensitive, the AI often did not identify them as rapport-building strategies. These instances served as valuable training data, enabling us to further refine the model.
Rethinking assessment infrastructure
Perhaps the most pragmatic outcome of the trial was the realisation that AI avatars could meaningfully reduce reliance on standardised patients for certain communication assessments. Traditional role‑plays are resource‑intensive, difficult to standardise and challenging to scale. They also introduce variability that is not always pedagogically useful.
AI does not eliminate complexity, but it shifts it – from logistics to design, from scheduling to calibration. Once built and validated, the system can offer consistent scenarios while still producing rich, qualitative evidence for assessment and feedback.
What we learned
First, AI can be meaningfully embraced in assessment – but only when it is aligned with clear pedagogical intent, not with novelty.
Second, human judgement remains indispensable. AI can surface patterns and provide efficiency, but accountability and interpretation must stay with educators.
Third, for communication‑heavy competencies, AI avatars can serve as a viable alternative to standardised patients, significantly reducing time and resource demands.
Finally, experimentation matters. The system is imperfect and will remain so. But its capacity to improve, scale and support more authentic assessment is substantial, provided institutions are willing to trial, reflect and iterate.
In the end, the most interesting outcome was not that AI could assess communication, but that it prompted us to look more closely at how we do so – and how much evidence we usually leave unheard.
May Lim is assistant provost (applied learning) and Caleb Or is assistant director (centralised skills assessment and validation initiative), both at the Singapore Institute of Technology.
If you would like advice and insight from academics and university staff delivered direct to your inbox each week, sign up for the Campus newsletter.




