当前位置：首页 > news >正文

论文速读记录 | 2026.05

news 2026/5/1 18:27:08

On Variational Bounds of Mutual Information
On the Role of Iterative Computation in Reinforcement Learning
WileReward: Learning Reward Models from In-the-Wild Human Interactions
Can We Really Learn One Representation to Optimize All Rewards?
The Magic Correlations: Understanding Knowledge Transfer from Pretraining to Supervised Fine-Tuning
Improving Interactive In-Context Learning from Natural Language Feedback
Learning to Learn with Contrastive Meta-Objective
Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences
MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
auto-curriculum learning (Jiang et al., 2021b)
Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL
Unsupervised Skill Discovery via Recurrent Skill Training
Learning to Discover Skills through Guidance
One After Another: Learning Incremental Skills for a Changing World
Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching
Horizon Generalization in Reinforcement Learning
HIQL: Offline Goal-Conditioned RL with Latent States as Actions
Contrastive Preference Learning: Learning from Human Feedback without RL
Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
Rethinking Reward Modeling in Preference-based Large Language Model Alignment
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset
Data Center Cooling System Optimization Using Offline Reinforcement Learning
SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
Thinkless: LLM Learns When to Think
Learning to Reason without External Rewards

On Variational Bounds of Mutual Information

来源：合作者推荐的文章，里面有互信息的各种 bounds。ICML 2019。
arxiv：https://arxiv.org/abs/1905.06922

On the Role of Iterative Computation in Reinforcement Learning

来源：Eysenbach 新文章，跟 RL 有关，好像 abstract 有点吸引人；我可能没看过这类文章，有点好奇。
arxiv：https://arxiv.org/abs/2602.05999

WileReward: Learning Reward Models from In-the-Wild Human Interactions

来源：专家的最新工作，从人类轨迹中（？）提取信息，训 reward model。
arxiv：https://arxiv.org/abs/2602.08829

Can We Really Learn One Representation to Optimize All Rewards?

来源：Eysenbach 和 chongyi zheng 的新文章，看不太懂，但是有点好奇，简单速览一下吧。
arxiv：https://arxiv.org/abs/2602.11399

The Magic Correlations: Understanding Knowledge Transfer from Pretraining to Supervised Fine-Tuning

来源：Google scholar 邮件推送，好像研究 SFT 为预训练的 LLM 带来了什么，可以速览一下。
arxiv：https://arxiv.org/abs/2602.11217

Improving Interactive In-Context Learning from Natural Language Feedback

来源：Google scholar 邮件推送，好像研究如何把静态数据转化为 multi-turn 数据，用于训练 LLM，可以速览一下。
arxiv：https://arxiv.org/abs/2602.16066

Learning to Learn with Contrastive Meta-Objective

来源：无意中看到的，NeurIPS 2025 oral。
arxiv：https://arxiv.org/abs/2410.05975

（还没读。这篇文章看起来比较古典，做的是传统 ML，并不是做 llm 的。
（这个东西能用在 llm 上吗？现在看到一个东西，就会想它能否用在 llm 上

Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

来源：无意中搜到的。ICRA 2025。
arxiv：https://arxiv.org/abs/2409.07268
GitHub：https://github.com/FeiCuiLengMMbb/paper_MTPL
好奇是不是 multi-type + PbRL。

MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration

arxiv：https://arxiv.org/abs/2006.08170
来源：合作者说有趣的 skill + meta-RL 论文，ICML 2021。

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

arxiv：https://arxiv.org/abs/2505.03335
来源：neurips 2025 best paper 的一作 yue yang 的 NeurIPS 2025 spotlight 工作。被题目吸引住了，单纯好奇，想读一读。

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

arxiv：https://arxiv.org/abs/2202.00161
来源：想起来，想看一下。

auto-curriculum learning (Jiang et al., 2021b)

来源：RSD。似乎可以做自动 curriculum learning，或许是有启发性的。

Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL

来源：RGSD。可能包含一个技能库，也想看。速读一下就行。

Unsupervised Skill Discovery via Recurrent Skill Training

来源：合作者推荐的 skill discovery 先前工作。

Learning to Discover Skills through Guidance

来源：同上。

One After Another: Learning Incremental Skills for a Changing World

来源：同上。

Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching

来源：同上。

Horizon Generalization in Reinforcement Learning

arxiv：https://arxiv.org/abs/2501.02709
website：https://horizon-generalization.github.io/
来源：Benjamin Eysenbach 的新作，是一篇 arxiv paper，同学说有趣。

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

arxiv：https://arxiv.org/abs/2307.11949
website：https://seohong.me/projects/hiql/
来源：合作者推荐的文章，好像也是 Benjamin Eysenbach 发表的。

Contrastive Preference Learning: Learning from Human Feedback without RL

arxiv：https://arxiv.org/abs/2310.13639
GitHub：https://github.com/jhejna/cpl
来源：无意中搜到的文章，ICLR 2024，好像之前读过。
主要内容：

Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning

arxiv：https://arxiv.org/abs/2502.08985
来源：同学的最新工作。
主要内容：
- 这篇文章关注的 setting 是 offline multi-task MARL；特别的，agent 只在（比如说）三个人合作的场景上训练，然后就可以泛化到任意多个人合作的场景。同学讲的故事是，用 transformer 作为一个翻译器，把三个人的合作动作翻译为多个人的，感觉这个故事听起来非常好。

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

arxiv：https://arxiv.org/abs/2411.04991
OpenReview：https://openreview.net/forum?id=rfdblE10qm
来源：ICLR 2025 oral。
主要内容：
- 这篇文章关注 LLM 的 RLHF。据说不采用 bradley-terry model 来建模 reward model，而是直接训一个分类器，学习一个 (x,y) 是好的还剩坏的，然后使用分类器的概率 logit 作为 RLHF 的 reward。
- 是否使用了非成对的比较 \((x_1, y_1^+, x_2, y_2^-)\)，而非把成对比较 \((x, y^+, y^-)\) 打乱（？）
- 实验是否过于 toy（？）理论大概说了什么（？）