Reinforcement Learning with Human Feedback
此内容尚不支持你的语言。
PPO
read more
DPO
DPO starts with preference data pairs(prompt, accepted and rejected) and a SFT model without the requirement of reward model. One thing worth mentioning is the preference data should be in distribution, i.e., it should be generated by the SFT model. Otherwise, consider fine-tuning the SFT model with some of the preference data to alleviate distribution shift before the actual DPO training,
read more