Reinforcement Learning with Human Feedback

DPO starts with preference data pairs(prompt, accepted and rejected) and a SFT model without the requirement of reward model. One thing worth mentioning is the preference data should be in distribution, i.e., it should be generated by the SFT model. Otherwise, consider fine-tuning the SFT model with some of the preference data to alleviate distribution shift before the actual DPO training,