滴滴KDD2018：强化学习派单

专注坚持

2019-06-30

关注关注

白话解读

离线learning部分

本质上是将任意时刻任意空间位置离散化为时空网格，根据派单记录（含参加调度但无单的司机）计算该时空网格到当天结束时刻的预期收入。

关键问题：怎么计算预期收入？

动态规划思路：假设总共有时刻区间为[0, T)；先计算T-1时刻的所有网格的预期收入（此时未来收入为0，只有当前收入），其本质就是计算当前收入的均值；然后计算T-2时刻的所有网格的预期收入；...；以此类推

这样的话，就可以计算出每个时空网格到当天结束时刻的预期收入。

滴滴KDD2018：强化学习派单

重点：为什么按照这个方式得到的值函数是合理的？

The resultant value function captures spatiotemporal patterns of both the demand side and the supply side. To make it clearer, asa special case, when using no discount and an episode-length of a day, the state-value function in fact corresponds to the expected revenue that this driver will earn on average from the current time until the end of the day.

在线planning部分

使用以下公式描述订单和司机之间的匹配度：

滴滴KDD2018：强化学习派单

价格越高，匹配度越高
当前位置价值越大，匹配度越低
未来位置价值越大，匹配度越高
接驾里程，隐形表达，越大则预计送达时间越大，衰减系数越小，匹配度越低

使用KM算法求解匹配结果

评估方案

AB-test方案

we adopted a customized A/B testing design thatsplits tra c according to large time slices (three or six hours). Forexample, a three-hour split sets the rst three hours in Day 1 to runvariant A and the next three hours for variant B. The order is thenreversed for Day 2. Such experiments will last for two weeks toeliminate the daily di erence. We select large time slices to observelong-term impacts generated by order dispatch approaches.

实际收益

the performance improvementbrought by the MDP method is consistent in all cities, with gains inglobal GMV and completion rate ranging from 0.5% to 5%. Consis-tent to the previous discoveries, the MDP method achieved its bestperformance gain in cities with high order-driver ratios. Meanwhile,the averaged dispatch time was nearly identical to the baselinemethod, indicating little sacrifice in user experience

Value function可视化效果

滴滴KDD2018：强化学习派单

如何包装为强化学习

将时空网格定义为state；将派单和不派单定义为action；将state的预期收入定义为状态值函数。

强化学习的目的是求解最优策略，也等价于求解最优值函数。派单场景的独特的地方是，建模的时候agent是每个司机，做决策的时候是平台决策，所以司机其实是没有策略的，或者说，通过派单机制，司机的策略被统一化为使平台的期望收入最大。因此在强化学习的框架下，可以将离线learning和在线planning认为是policy iteration的两个步骤，learning是更新value function，planning是policy update。然而，其实细想起来，还是有些勉强。

强化学习

安科网

滴滴KDD2018：强化学习派单

专注坚持

白话解读

离线learning部分

在线planning部分

评估方案

AB-test方案

实际收益

Value function可视化效果

如何包装为强化学习

专注坚持

相关推荐

强化学习到底是什么，它如何运作？

AlphaGo原来是这样运行的，一文详解多智能体强化学习

Menger:大规模分布式强化学习架构

Science 好文：强化学习之后，机器人学习瓶颈如何突破？

监督学习、非监督学习、强化学习都是什么？终于有人讲明白了

监督学习、非监督学习、强化学习都是什么？终于有人讲明白了

几行代码实现强化学习

强化学习

无梯度强化学习：使用Numpy进行神经进化

强化学习 --- 马尔科夫决策过程详解（MDP）

5种用于Python的强化学习框架

李宏毅的强化学习视频用于梳理翻阅（4）奖励、模仿

《AutoDL论文解读（一）：基于强化学习的开创性工作》

卡耐基梅隆大学（CMU）元学习和元强化学习课程 | Elements of Meta-Learning

你该知道的深度强化学习相关知识

【论文研读】强化学习入门之DQN

告别炼丹，Google Brain提出强化学习助力Neural Architecture Search | ICLR2017

DeepMind发布神经网络、强化学习库，网友：推动JAX发展

<强化学习> on policy VS off policy

<强化学习>基于采样迭代优化agent

专注坚持