Large Language Models (LLMs) have demonstrated high IQ, scoring in the top percentiles on common standardized tests. But what about their emotional intelligence?
Emotional intelligence is a deeply human trait, nuanced and integral to our daily problem-solving. Can LLMs grasp this type of intelligence and leverage it in the same way we can?
That’s what researchers from Microsoft and others set out to discover. In their paper, Large Language Models Understand and Can Be Enhanced by Emotional Stimuli, they explored how LLMs react to emotionally appealing prompts. Introducing a method they aptly call “EmotionPrompt”.
What is EmotionPrompt?
EmotionPrompt is really simple. All it does is add an emotional stimuli at the end of a prompt to attempt to enhance the performance.
The researchers started off by creating a list of emotional stimuli to be used in their experiments.
These emotional stimuli were developed using three well-established physiological phenomena: Self-Monitoring, Social Cognitive Theory, and Cognitive Emotion Regulation Theory.
We put together a (very) simple template so you can try out this method easily in PromptHub.
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.
Automated Experiments
Setup
The researchers evaluated EmotionPrompt across a wide range of tasks, using two datasets.
They assessed its performance in both zero-shot and few-shot prompting scenarios.
Datasets:
- Instruction Induction [22]
- BIG-Bench [23]
Models:
- Flan-T5-Large [24]
- Vicuna [25]
- Llama 2 13B [26]
- BLOOM [27]
- ChatGPT [28]
- GPT-4
Benchmarks: The standard prompt without emotional cues.
Results
What stands out:
- EmotionPrompt significantly improves the performance by 8.00% in Instruction Induction and 115% in BIG-Bench
- The large discrepancy between the performance on the 2 datasets is mostly related to the nature of the tasks in the Big Bench dataset. BIG-Bench tasks are designed to be more challenging and diverse, potentially requiring more nuanced or human-like responses.
- EmotionPrompt mostly outperforms other prompting engineering methods like CoT and APE.
Human Study
Setup
The researchers engaged 106 participants for a hands-on study using GPT-4. They crafted 30 questions, each generating two distinct responses: one from a standard prompt, one using EmotionPrompt.
Evaluation
Participants were tasked with assessing the model's outputs, focusing on performance, truthfulness, and responsibility. Each response was rated on a scale of 1 to 5 based.
What stands out:
- Participants consistently rated EmotionPrompt responses higher, across all three metrics.
- Specifically, for performance, EmotionPrompt achieved a relative gain of 1.0 or greater (a 20% increase !) in nearly one-third of the tasks.
- Only in two instances did EmotionPrompt exhibit shortcomings.
- In a poem composition comparison, EP’s poem was perceived as more creative
- EmotionPrompt yielded a 19% improvement in truthfulness
- The human study reinforces the quantitative findings, emphasizing EmotionPrompt's real-world applicability and resonance with users.
Wrapping up
A few final takeaways from the paper:
- Combining multiple emotional stimuli brings little or no additional gains.
- The effectiveness of stimuli varies with the task.
- The larger the LLM, the greater the benefits from EmotionPrompt
- As the temperature setting rises, so does the relative gain.
The next time you’re working on a prompt or using ChatGPT, try out using some of the emotional stimuli used in these experiments. You may easily get better results!