Skip to content

Reinforcement Learning with Human Feedback

PPO


read more

DPO

DPO starts with preference data pairs(prompt, accepted and rejected) and a SFT model without the requirement of reward model. One thing worth mentioning is the preference data should be in distribution, i.e., it should be generated by the SFT model. Otherwise, consider fine-tuning the SFT model with some of the preference data to alleviate distribution shift before the actual DPO training,


read more