Getting Emotional With LLMs

Large Language Models (LLMs) have demonstrated high IQ, scoring in the top percentiles on common standardized tests. But what about their emotional intelligence?

Emotional intelligence is a deeply human trait, nuanced and integral to our daily problem-solving. Can LLMs grasp this type of intelligence and leverage it in the same way we can?

That’s what researchers from Microsoft and others set out to discover. In their paper, Large Language Models Understand and Can Be Enhanced by Emotional Stimuli, they explored how LLMs react to emotionally appealing prompts. Introducing a method they aptly call “EmotionPrompt”.

What is EmotionPrompt?

EmotionPrompt is really simple. All it does is add an emotional stimuli at the end of a prompt to attempt to enhance the performance.

‍

An overview, comparing a typical prompt with EmotionPrompt — EmotionPrompt is easy to use, just add a phrase to your prompt

‍

The researchers started off by creating a list of emotional stimuli to be used in their experiments.

These emotional stimuli were developed using three well-established physiological phenomena: Self-Monitoring, Social Cognitive Theory, and Cognitive Emotion Regulation Theory.

‍

A table showing the different types of EmotionPrompts, broken down into categories

‍

We put together a (very) simple template so you can try out this method easily in PromptHub.

If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.

‍

The EmotionPrompt template in the PromptHub web application

‍

Automated Experiments

Setup

The researchers evaluated EmotionPrompt across a wide range of tasks, using two datasets.

They assessed its performance in both zero-shot and few-shot prompting scenarios.

Datasets:

Instruction Induction [22]
BIG-Bench [23]

Models:

Flan-T5-Large [24]
Vicuna [25]
Llama 2 13B [26]
BLOOM [27]
ChatGPT [28]
GPT-4

Benchmarks: The standard prompt without emotional cues.

‍

Results

‍

4 charts showing how typical prompting compared to EmotionPrompt across various models in the instruction induction test set — Results from the Instruction Induction Dataset

‍

4 charts showing how typical prompting compared to EmotionPrompt across various models in the Big Bench test set — Results from the BIG-Bench Dataset

‍

What stands out:

EmotionPrompt significantly improves the performance by 8.00% in Instruction Induction and 115% in BIG-Bench
The large discrepancy between the performance on the 2 datasets is mostly related to the nature of the tasks in the Big Bench dataset. BIG-Bench tasks are designed to be more challenging and diverse, potentially requiring more nuanced or human-like responses.
EmotionPrompt mostly outperforms other prompting engineering methods like CoT and APE.

Human Study

Setup

The researchers engaged 106 participants for a hands-on study using GPT-4. They crafted 30 questions, each generating two distinct responses: one from a standard prompt, one using EmotionPrompt.

Evaluation

Participants were tasked with assessing the model's outputs, focusing on performance, truthfulness, and responsibility. Each response was rated on a scale of 1 to 5 based.

‍

Performance of typical prompting versus EmotionPrompt from the hmuman study, across performance, truthfulness, and responsibility

‍

What stands out:

Participants consistently rated EmotionPrompt responses higher, across all three metrics.
Specifically, for performance, EmotionPrompt achieved a relative gain of 1.0 or greater (a 20% increase !) in nearly one-third of the tasks.
Only in two instances did EmotionPrompt exhibit shortcomings.
In a poem composition comparison, EP’s poem was perceived as more creative
EmotionPrompt yielded a 19% improvement in truthfulness
The human study reinforces the quantitative findings, emphasizing EmotionPrompt's real-world applicability and resonance with users.

‍

Wrapping up

A few final takeaways from the paper:

Combining multiple emotional stimuli brings little or no additional gains.
The effectiveness of stimuli varies with the task.
The larger the LLM, the greater the benefits from EmotionPrompt
As the temperature setting rises, so does the relative gain.

‍

Performance of typical prompting versus EmotionPrompt from the human study, across different temperature settings

‍

The next time you’re working on a prompt or using ChatGPT, try out using some of the emotional stimuli used in these experiments. You may easily get better results!