Question

我想在Python中使用Q学习实现文章推荐器。例如，我们的数据集有四个类别的文章，包括健康，体育，新闻和生活方式，每个类别有10篇文章（共40篇文章）。想法是向用户显示一些随机文章（例如，五篇文章，它们可以来自任何类别），并接收他/她的反馈。然后，代理学习用户的偏好（即，商品的类别），并再次推荐一些相关的商品。

要将其表述为RL问题，我知道我应该定义动作，状态和奖励函数。研究了一些文章之后，我想到了：

操作：推荐文章；

状态：我对此不太清楚，但是我从其他文章中学到的是状态可以是：

a）用户最近研究过的文章的踪迹； b）用户兴趣（不确定状态如何）

奖励：非常简单的奖励。如果用户学习推荐的文章，则可以为+1；对于无用的推荐，则可以为-1。

对于Q学习部分，我不确定如何制作包含状态（作为行）和动作（作为列）的Q表。

对于其他一些简单的RL问题，例如MountainCar，开发q表并不那么困难，但是这里状态不太清楚的方式使我感到困惑。

如果您能帮助我提出将其表达为RL问题的解决方案以及几行代码来启发我如何开始编写代码，我将不胜感激。

Answer 1

如果您不确定状态，则可以使用多臂土匪算法，在其中您可以采取行动并获得奖励。

如果您想使用用户最近研究过的文章的踪迹，则可以使用考虑状态的上下文强盗算法。由于该情节仅一步之遥，因此，与其说是强化学习，不如说是情境匪徒问题。

尽管如此，您可以使用类似的方法进行训练。

state = env.reset()
state_buffer = []
# now append the history of the user in state_buffer list
# so now state_buffer has the most recent five states observed
# here state can be a vector of size four with entries being one if that article was read by the user in that time-step
# if the previous article read by the user is health then [1,0,0,0]
# if the previous article read by the user is sports then [0,1,0,0]
# action is which article to show next

# run for 1000 episodes
for i in range(1000):
    action = policy.select_action(state_buffer)
    next_state,reward,done,info = env.step(action)
    policy.update()
    state_buffer.pop(0)
    state_buffer.append(next_state)
# NOTE: you will have to implement policy class which has a function approximator (neural-network)
# policy.update() does the Q-learning update

此外，您可以浏览this博客。

如何将文章推荐者建模为Python中的Q学习问题

1 个答案: