Sumon Biswas
Sumon Biswas
Home
Publication
Service
Projects
Teaching
News
Blogs
Contact
Light
Dark
Automatic
reasoning
Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
We propose PTA-GRPO, a two-stage framework that improves LLM reasoning by combining high-level planning guidance with guidance-aware reinforcement learning.
Zhihao Dou
,
Qinjian Zhao
,
Zhongwei Wan
,
Dinggen Zhang
,
Weida Wang
,
Towsif Raiyan
,
Benteng Chen
,
Qingtao Pan
,
Yang Ouyang
,
Zhiqiang Gao
,
Shufei Zhang
,
Sumon Biswas
Preprint
Cite
×