Excited to announced Contrastive Preference Learning (CPL), a simple RL-free method for RLHF that works with arbitrary MDPs and off-policy data. arXiv: arxiv.org/abs/2310.13639 With @rm_rafailov @harshit_sikchi @chelseabfinn @scottniekum W. Brad Knox @DorsaSadigh A thread🧵👇
1
27
202
56K
80
Download Gif
While most RLHF methods assume that preferences are distributed according to reward, recent work (Knox et al 2023) show that they are better modeled by *regret* under the optimal policy. For ex, we prefer behaviors closer to our intended policy, like moving towards a goal. 2/8