当前位置：首页 > news >正文

论文速读记录 | 2025.11

news 2026/7/12 10:03:01

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
auto-curriculum learning (Jiang et al., 2021b)
Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL
Unsupervised Skill Discovery via Recurrent Skill Training
Learning to Discover Skills through Guidance
One After Another: Learning Incremental Skills for a Changing World
Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching
Horizon Generalization in Reinforcement Learning
HIQL: Offline Goal-Conditioned RL with Latent States as Actions
Contrastive Preference Learning: Learning from Human Feedback without RL
Controlled Diversity with Preference: Towards Learning a Diverse Set of Desired Skills
Human-Aligned Skill Discovery Balancing Behaviour Exploration and Alignment
Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Rethinking Reward Modeling in Preference-based Large Language Model Alignment
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset
Data Center Cooling System Optimization Using Offline Reinforcement Learning
SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
Thinkless: LLM Learns When to Think
Learning to Reason without External Rewards

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

来源：想起来，想看一下。
arxiv：https://arxiv.org/abs/2202.00161

auto-curriculum learning (Jiang et al., 2021b)

来源是 RSD。似乎可以做自动 curriculum learning，或许是有启发性的。

Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL

来源是 RGSD。可能包含一个技能库，也想看。速读一下就行。

Unsupervised Skill Discovery via Recurrent Skill Training

Learning to Discover Skills through Guidance

One After Another: Learning Incremental Skills for a Changing World

Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching

Horizon Generalization in Reinforcement Learning

arxiv：https://arxiv.org/abs/2501.02709
website：https://horizon-generalization.github.io/
来源：Benjamin Eysenbach 的新作，是一篇 arxiv paper，同学说有趣。
主要内容：

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

arxiv：https://arxiv.org/abs/2307.11949
website：https://seohong.me/projects/hiql/
来源：合作者推荐的文章，好像也是 Benjamin Eysenbach 发表的。

Contrastive Preference Learning: Learning from Human Feedback without RL

arxiv：https://arxiv.org/abs/2310.13639
GitHub：https://github.com/jhejna/cpl
来源：无意中搜到的文章，ICLR 2024，好像之前读过。
主要内容：

Controlled Diversity with Preference: Towards Learning a Diverse Set of Desired Skills

arxiv：https://arxiv.org/abs/2303.04592
来源：[mask]

Human-Aligned Skill Discovery Balancing Behaviour Exploration and Alignment

arxiv：https://arxiv.org/abs/2501.17431
来源：[mask]

Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning

arxiv：https://arxiv.org/abs/2502.08985
来源：同学的最新工作。
主要内容：
- 这篇文章关注的 setting 是 offline multi-task MARL；特别的，agent 只在（比如说）三个人合作的场景上训练，然后就可以泛化到任意多个人合作的场景。同学讲的故事是，用 transformer 作为一个翻译器，把三个人的合作动作翻译为多个人的，感觉这个故事听起来非常好。

SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks

arxiv：https://arxiv.org/abs/2410.16024
来源：在知乎看到的，但现在知乎帖子好像找不到了）
主要内容：
- 用 LLM 生成打 smac 的 python 决策树代码。
- 具体 method：

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

arxiv：https://arxiv.org/abs/1903.08254
来源：[mask]
主要内容：
- 这篇文章提出了 PERAL 方法。

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

arxiv：https://arxiv.org/abs/1910.08348
来源：[mask]
主要内容：
- 这篇文章提出了 VariBAD 方法。

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

arxiv：https://arxiv.org/abs/2411.04991
OpenReview：https://openreview.net/forum?id=rfdblE10qm
来源：ICLR 2025 oral。
主要内容：
- 这篇文章关注 LLM 的 RLHF。据说不采用 bradley-terry model 来建模 reward model，而是直接训一个分类器，学习一个 (x,y) 是好的还剩坏的，然后使用分类器的概率 logit 作为 RLHF 的 reward。
- 是否使用了非成对的比较 \((x_1, y_1^+, x_2, y_2^-)\)，而非把成对比较 \((x, y^+, y^-)\) 打乱（？）
- 实验是否过于 toy（？）理论大概说了什么（？）

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

arxiv：https://arxiv.org/abs/2410.05527
open review：https://openreview.net/forum?id=2iYVBqRHK4
来源：合作者推荐的文章。
主要内容：
- preference-based index policy（？）

Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset

来源：师兄的文章。

Data Center Cooling System Optimization Using Offline Reinforcement Learning

arxiv：https://arxiv.org/pdf/2501.15085
来源：xianyuan zhan 组的新文章。
主要内容：
- T-symmetry。

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

arxiv：https://arxiv.org/abs/2407.04752
来源：师兄推荐的神秘文章，ICLR 2025 poster。

Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

arxiv：https://arxiv.org/abs/2410.23680
来源：偶然看到的文章。

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

来源：师兄偶然提到，系里其他人的文章。

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

arxiv：https://arxiv.org/abs/2505.21067
来源：偶然看到的文章。

Thinkless: LLM Learns When to Think

arxiv：https://arxiv.org/abs/2505.13379
来源：偶然看到的文章。

Learning to Reason without External Rewards

arxiv：https://arxiv.org/abs/2505.19590
来源：偶然看到的文章。

查看全文

http://www.jsqmd.com/news/29081/

实用指南：Spring进阶 - Spring AOP实现原理（一）AOP切面实现原理

win11 新增小鹤双拼输入法

无法从资源管理器拖动文件到文档大师的解决方法

2025 年 11 月 304 不锈钢机箱机柜，5052 铝机箱机柜，6061 铝机箱机柜厂家最新推荐，产能、专利、环保三维数据透视！

2025年11月学习机品牌评价榜：从读书郎到随机四强的全维度横评

2025年11月洗地机产品对比：十款真蒸汽双舱机型排名解析

2025年11月洗地机产品推荐：十强机型深度评测榜单

2025年11月洗地机产品推荐：真蒸汽与静音技术深度评测榜

2025年11月学习机品牌推荐：新课标同步辅导榜单一览

WTAPI微信开发框架说明

2025年11月卖得好的学习机品牌推荐：畅销榜数据解析与选购排行

算法和基本概念

2025年11月卖得好的学习机品牌推荐：家长口碑榜五强评价指南

最近发生和发现的一些小事、疑问

一文读懂分布式系统设计：CAP和BASE理论超简单讲解

2025年11月领先品牌认证机构优选榜：尚普咨询集团华信人对比评测

2025年10月中国管理咨询公司口碑榜：十强排名全解析

2025年10月中国管理咨询公司排行榜：十家优选机构评价

题解：uoj671【UNR #5】诡异操作

2025年10月中国管理咨询公司对比榜：十强参数全解析

2025年11月上海装修公司排行榜：十强资质与工期数据对比

2025年11月上海装修公司服务榜：十强真实案例与用户满意度对比

浮点数存

2025年11月上海装修公司排行榜：十强性价比与满意度对比

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

auto-curriculum learning (Jiang et al., 2021b)

Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL

Unsupervised Skill Discovery via Recurrent Skill Training

Learning to Discover Skills through Guidance

One After Another: Learning Incremental Skills for a Changing World

Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching

Horizon Generalization in Reinforcement Learning

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

Contrastive Preference Learning: Learning from Human Feedback without RL

Controlled Diversity with Preference: Towards Learning a Diverse Set of Desired Skills

Human-Aligned Skill Discovery Balancing Behaviour Exploration and Alignment

Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning

SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset

Data Center Cooling System Optimization Using Offline Reinforcement Learning

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

Thinkless: LLM Learns When to Think

Learning to Reason without External Rewards

相关文章：