FSPO | Few-Shot Optimization of Synthetic Preferences

Abstract

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context capabilities of LLMs, we propose few-shot preference optimization (FSPO), an algorithm for LLM personalization that reframes reward modeling as a meta-learning problem. Under FSPO, an LLM learns to quickly infer a personalized reward function for a user via a few labeled preferences. FSPO also utilizes user description rationalization (RAT) to encourage better reward modeling and instruction following, recovering performance with the oracle user description. Since real-world preference data is challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. To successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, education, and open-ended question answering. We also run a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate in generating responses that are personalized to synthetic users and a 70% winrate with real human users in open-ended question answering.

Contributions

Few-shot preference optimization

We introduce FSPO, a meta-learning framework that uses the in-context learning ability of LLMs to personalize from only a few preference examples.

User description rationalization

We propose RAT, a two-stage procedure that infers a user description from few-shot preferences and uses it to improve reward modeling and generation.

Synthetic-to-real transfer

We identify design choices that make synthetic preference data transfer more effectively to real human participants in a controlled study.

Personalization benchmark

We release benchmark tasks across Reviews, Explain Like I'm X (ELIX), and Roleplay for studying personalized open-ended generation.

Method Overview

Preference fine-tuning methods typically optimize a reward function from a pooled preference dataset. This works well for population-level behavior, but it can wash out the preferences that make individual users distinct. FSPO asks a more targeted question: can a model learn a personalized reward function directly from a few examples of one user's preferences?

Meta-learning objective

FSPO treats each user as a task in a meta-learning problem. Each task contains prompts, preferred responses, and dispreferred responses for that user. At training time, the model conditions on N preference examples and is optimized to increase the likelihood of preferred responses while decreasing the likelihood of dispreferred responses, using a preference optimization loss such as DPO or IPO.

The only additional identifier needed is a scorer ID that distinguishes users. This lets the model learn how preferences vary across users while still reusing shared structure across the broader training population.

User description rationalization (RAT)

User description rationalization diagram.

When a user description is available, even if synthetically generated, FSPO can be framed as a two-stage prediction process. First, the model infers a user description from the few-shot preferences. Then it generates a response conditioned on the prompt, the preference examples, and the inferred description.

RAT gives the model an interpretable intermediate representation of the user's preferences. It also uses additional inference-time compute to improve reward modeling and instruction following.

Algorithm overview

FSPO is straightforward to instantiate. We sample a minibatch of training users, draw a few-shot preference set for each user, and optionally include a user description. The model is then optimized on held-out preferences with a loss such as DPO or IPO. With RAT, the model first learns to predict the user description from few-shot examples, then conditions on that description and the examples to optimize the held-out preference response.

Representing users through preference labels

From an information-theoretic perspective, N binary preference labels can represent up to 2^N different personas or reward functions. Many signals can represent users, including surveys, chat histories, and interaction traces. In this work, we focus on the constrained setting of few-shot binary preferences because it isolates the personalization problem and supports transfer from synthetic personas to real users.

Takeaway. FSPO frames personalization as meta-learning over users, while RAT turns preference examples into an interpretable user description before response generation. Together, they let a model adapt to individual preferences from only a few labeled examples.

Benchmark Tasks

Evaluating personalization requires tasks where different users reasonably prefer different outputs. We introduce three benchmark domains that test whether models can adapt style, content, and behavior from a small set of preference examples.

Reviews

Generate reviews for movies, TV shows, and books that match a user's writing style and sentiment preferences.

ELIX

Explain concepts at the right level for a user's background, from simple educational levels to specialized expertise.

Roleplay

Generate open-ended responses that stay consistent with a user's description and transfer to real human-study participants.

We open-source the benchmark tasks, synthetic data, and evaluation scripts so the community can study few-shot personalization under a shared evaluation protocol.

Sim2Real: Synthetic Preference Data Transfers to Real Users

Personalized data is expensive to collect at scale. Human annotation is costly, and assembling a diverse set of users that reflects real-world variation is difficult. Synthetic data offers a practical alternative, but only if it captures the diversity and consistency needed for models to transfer to real people.

Domain randomization and iterative persona improvement diagram.

We propose two design principles for synthetic-to-real transfer: encourage diversity and construct structured tasks. Diversity helps synthetic data cover a broad range of user preferences. Structure ensures that each user's preferences are coherent and self-consistent.

To encourage diversity, we sample prompts from human participants and public data sources, then augment them with few-shot prompting. For responses, we use Persona-Steering or View Conditioning. To improve structure, we label preferences consistently with a modified Alpaca Eval prompt and iteratively refine user descriptions to reduce underspecification.

Takeaway. Synthetic preference data transfers best when it is both broad enough to cover diverse users and structured enough to preserve coherent preferences within each user.

Results

We compare FSPO against a base model with no personalization, few-shot prompting, few-shot preference fine-tuning (Pref-FT), aggregated IPO, Rewards-in-Context, VPL, and prompting with the ground-truth persona. Across Roleplay, ELIX, and Reviews, FSPO outperforms the baselines for personalized generation. RAT further narrows the gap to the oracle setting, where the model is prompted with the true persona.

For Reviews, we evaluate both trained users (concise, verbose, positive, negative) and interpolated users that combine these traits. For ELIX, we evaluate an easier task with five educational levels and a harder task with specialized backgrounds such as a PhD in physics. For Roleplay, we evaluate 1,500 users on held-out users and prompts, then run a controlled human study through Prolific.

Table 1: Automatic win rates on Roleplay (1,500 users)
Method	Win rate (%)
Base (Llama 3.2 3B Instruct)	50.0
IPO	72.4
Few-shot Prompting	63.2
Few-shot Pref-FT	62.8
Rewards-in-Context	53.3
VPL	67.3
FSPO (ours, DPO)	81.3
FSPO (ours, IPO)	82.6
FSPO + RAT (ours, IPO)	90.3
Oracle (prompt with ground-truth persona)	90.9

Table 2: Win rates on ELIX-easy and ELIX-hard (550 users)
Method	ELIX-easy	ELIX-hard
Base	50.0	50.0
Few-shot Prompted	92.4	81.4
Few-shot Pref-FT	91.2	82.9
FSPO (ours)	97.8	91.8

Table 3: Review win rates for trained and interpolated users
Method	Trained	Interpolated
Base (Llama 3.2 3B Instruct)	50.0	50.0
Few-shot Prompted (4-shot)	66.6	61.9
Few-shot Pref-FT (4-shot)	66.5	66.1
FSPO (4-shot, ours)	78.4	71.3
Few-shot Prompted (8-shot)	69.1	59.1
Few-shot Pref-FT (8-shot)	65.6	70.7
FSPO (8-shot, ours)	80.4	73.6
FSPO + RAT (8-shot, ours)	92.3	84.6

Table 4: Roleplay human evaluation win rates
Comparison	Win rate (%)
FSPO vs. Base	68.2 ± 1.93
FSPO vs. SFT	72.3 ± 1.34

Takeaway. Across the three benchmark tasks, FSPO averages an 87% Alpaca Eval win rate for responses personalized to synthetic users. In the controlled human study, FSPO reaches a 70% win rate with real users for open-ended question answering.

BibTeX

@misc{singh2026fspo,
  title={FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users},
  author={Anikait Singh and Sheryl Hsu and Kyle Hsu and Eric Mitchell and Stefano Ermon and Tatsunori Hashimoto and Archit Sharma and Chelsea Finn},
  year={2026},
  eprint={2502.19312},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2502.19312}
}

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

FSPO learns to personalize language model responses from a small set of user preference labels. The framework uses in-context preference examples to infer user-specific reward functions, and studies how synthetic preference data can be designed to transfer to real people.