Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But this is just the SFT - "distilled" model, not the one optimized with RL, right?


Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: