Joey Hejna @JoeyHejna, Twitter Profile

Joey Hejna @JoeyHejna

6 months ago

Excited to announced Contrastive Preference Learning (CPL), a simple RL-free method for RLHF that works with arbitrary MDPs and off-policy data. arXiv: arxiv.org/abs/2310.13639 With @rm_rafailov @harshit_sikchi @chelseabfinn @scottniekum W. Brad Knox @DorsaSadigh A thread🧵👇

1 27 202 56K 80

Download Gif

Joey Hejna @JoeyHejna

6 months ago

While most RLHF methods assume that preferences are distributed according to reward, recent work (Knox et al 2023) show that they are better modeled by *regret* under the optimal policy. For ex, we prefer behaviors closer to our intended policy, like moving towards a goal. 2/8

1 0 5 784 0

Download Image