FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

Anikait Singh^1,❖, Sheryl Hsu^1,❖, Kyle Hsu¹, Eric Mitchell^1,4,
Stefano Ermon,¹ Tatsunori Hashimoto¹, Archit Sharma^1,2,◆, Chelsea Finn^1,◆

¹Stanford University ²Google DeepMind ³OpenAI
^❖Equal Contribution ^◆Equal Advising

Paper Code Data

As language models increasingly interact with a diverse user base, it becomes important for models to generate responses that align with individual user preferences. Few-Shot Preference Optimization (FSPO) is a meta-learning framework that leverages the strong in-context learning capabilities of an LLM to capture the diversity of human preferences. We additionally study crucial design decision to allow for effective transfer from synthetic to real users through synthetic preference data.

Contributions

Preference Optimization Framework: We propose Few-Shot Preference Optimization (FSPO), a meta-learning framework that leverages the strong in-context learning capabilities of LLMs to capture the diversity of human preferences
Sim2Real (Synthetic Preference Data): We study crucial design decisions to allow for effective transfer to real human participants through synthetic preference data in a controlled human study
Personalization Benchmark: We propose a benchmark consisting of 3 domains, where personalization can be studied: (1) Reviews, (2) Explain Like I'm X (ELIX), and (3) Roleplay, with effective transferability to a real human-study

Abstract

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

Method Overview

Generally, preference fine-tuning algorithms aim to optimize a reward function that captures the preferences of a user. To do so, these approaches aggregate preferences from a user into a single preference dataset, which is then used to fine-tune a language model through RLHF or Preference Optimization. Can we instead learn a personalized reward function directly from a few preferences of a user?

Meta-learning motivation and objective.

To instantiate this idea, we propose Few-Shot Preference Optimization (FSPO), where with just the weak additional requirement of a scorer id to differentiate users, we can learn a personalized reward function for each user. In FSPO, we consider each user as a new task instance in meta-learning, where each user has a preference dataset comprising of prompts, preferred responses, and dispreferred responses. Then, leveraging the strong in-context learning capabilities of LLMs, we condition the model on N preference examples from the user and fine-tune the model to maximize the likelihood of the preferred responses and minimize the likelihood of the dispreferred response using preference optimization loss such as DPO or IPO. This allows us to learn a personalized reward function for each user.

User description chain-of-thought (COT).

When a user description is provided (even if synthetically generated), FSPO can be reframed as a two-step prediction process, as illustrated above. In the first stage, the system generates a user description based on the user’s few-shot preferences. In the subsequent stage, it produces a response conditioned on the original prompt, the few-shot preferences, and the generated user description. This intermediate step not only offers an interpretable summary of the few-shot preferences but also serves as a better representation for reward modeling and response generation.

Algorithm Overview.

In all, FSPO can be easily instantiated. First, a minibatch of training users is sampled. For each user, we sample a few-shot preference dataset and (potentially) a user description. We then fine-tune the model on the few-shot preferences and the user description using a preference optimization loss such as DPO or IPO (equations 5 + 6). This allows us to learn a personalized reward function for each user. If we want to do COT, we train with teacher-forcing, first predicting the loss of the user description and then conditioned on the ground truth user description and the few-shot preferences, predict the loss on the responses. We then update with gradient descent. We summarize FSPO in Algorithm 1.

User representation through preference labels.

From an information-theoretic perspective, the few-shot binary preferences can be seen as a N-bit representation of the user, representing up to 2^N different personas or reward functions. There are several ways to represent users: surveys, chat histories, or other forms of interaction that reveal hidden preferences. We restrict our study to such a N-bit user representation, as such a constrained representation can improve the performance when transferring reward models learned on synthetic personalities to real users. We defer the study of less constrained representations to future work.

Takeaway: FSPO offers an effective approach to personalizing open-ended question answering, by framing personalization as a meta-learning problem, conditioned on few-shot preferences from a user. Additionally, FSPO can be converted to a two-step prediction problem, predicting a user description conditioned on preferences and then a response, leveraging additionally inference-compute and the model's instruction-tuned prior for better performance. We summarize the algorithm framework in Algorithm 1.

Benchmark Tasks

To evaluate the performance of FSPO in generating personalized responses, we require benchmark tasks that capture a wide range of user preferences to evaluate the performance of FSPO and baselines in generating personalized responses. We propose a benchmark consisting of 3 domains, where personalization can be studied: (1) Reviews, studying the generation ability of models for reviews of movies, TV shows, and books that are consistent with a user’s writing style, (2) Explain Like I'm X (ELIX): studying the generation ability of models for responses that are consistent with a user’s education level, and (3) Roleplay: studying the generation ability of models for responses that are consistent with a user's description, with effective transferability to a real human-study. We open source the benchmark tasks with the synthetic data and the evaluation scripts for the community to use.

Sim2Real: Synthetic Preference Data Transfers to Real Users

Collecting personalized data at scale presents significant challenges, primarily due to the high cost and inherent unreliability of human annotation. Curating a diverse set of users to capture the full spectrum of real-world variability further complicates the process, often limiting the scope and representativeness of the data. Synthetically generating data using a LLM is a promising alternative, since it can both reduce costly human data generation and annotation and streamline the data curation process. But how do we go about generating such a preference data using language models in a way that transfers to real people?

Following work on task construction from Meta-Learning, we propose two design decisions to effectively encourage transfer from synthetic to real users: (1) Encouraging Diversity and (2) Structured Task Construction. Encouraging diversity in the synthetic data allows us to capture a wide range of user preferences, while structured task construction allows us to capture coherent, self-consistent user preferences. For diversity, we sample a wide range of prompts from human participants and data sources, and augment these prompts by few-shot prompting with an LLM. For responses, we either used Persona-Steering or View Conditioning to encourage diversity. For structured task construction, we labeled preferences in a consistent manner, utilizing a modified version of the Alpaca Eval prompt, and reduced underspecification of the user description by iteratively refining the user description. We summarize the task construction process in the flowchart below.

Takeaway: Since collecting personalized data at scale is challenging, we propose instead to generate diverse synthetic preference datasets that can be transferred to real humans. We study two design decisions to effectively encourage this transfer: (1) Encouraging Diversity and (2) Structured Task Construction and discuss approaches to instantiate these design choices.

Results

Below, we evaluate FSPO against 4 baselines: (1) a base model generating responses without any personalization, (2) few-shot prompting the base model, (3) few-shot supervised fine-tuning (Pref-FT) based off GPO, and (4) prompting with the ground truth persona. We evaluate FSPO on the 3 benchmark tasks and find that FSPO outperforms all baselines in generating personalized responses. We also find that COT enables us to close the gap to the oracle method, where we prompt with the ground truth persona. For Reviews, we have 4 trained users (concise, verbose, positive, negative) and 4 interpolated users (concise + positive, verbose + positive, concise + negative, verbose + negative) to evaluate how FSPO generalizes to unseen users. Additionally, for ELIX, we have 2 tasks: ELIX-easy and ELIX-hard, where ELIX-easy is a simple task with 5 educational levels and ELIX-hard is a more challenging task with specializated educational levels such as a PhD in Physics. For Roleplay, we have 1500 total users and evaluate the performance of FSPO on held-out users and prompts for generating personalized responses. Additionally, we run a preliminary, controlled human study using Prolific, with a data collection app, that was built to collect human preferences from real human participants.

Method	Winrate (%)
Base (Llama 3.2 3B instruct)	50.0
IPO	72.4
Few-shot Prompting	63.2
Few-shot Pref-FT	62.8
FSPO (ours)	82.6
FSPO + COT (ours)	90.3
Oracle (prompt w/ g.t. persona)	90.9

Table 1: Automatic Winrates on Roleplay (1500 users)

Method	ELIX-easy	ELIX-hard
Base	50.0	50.0
Few-shot Prompted	92.4	81.4
Few-shot Pref-FT	91.2	82.9
FSPO (Ours)	97.8	91.8

Table 2: GPT-4o Winrates on ELIX-easy and ELIX-hard

Method	Trained	Interpolated
Base (Llama 3.2 3B instruct)	50.0	50.0
Few-shot Prompted (4-shot)	66.6	61.9
Few-shot Pref-FT (4-shot)	66.5	66.1
FSPO (4-shot, Ours)	78.4	71.3
Few-shot Prompted (8-shot)	69.1	59.1
Few-shot Pref-FT (8-shot)	65.6	70.7
FSPO (8-shot, Ours)	80.4	73.6

Table 3: Review Winrates - Trained and Interpolated Users

Baseline Method	Winrate (%)
FSPO vs Base	71.2
FSPO vs SFT	72.3

Table 4: Roleplay: Human Eval Winrates

Takeaway: We evaluate FSPO on the 3 tasks discussed and find an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users. COT also enables us to close the gap to the oracle method, where we prompt with the ground truth persona. Additionally, we run a preliminary, controlled human study, where we find a 72% winrate with real human users for open-ended question answering.