我有一个简单的pyspark
代码:
l = [
{'userId': 'u1', 'itemId': 'a1', 'click': 1},
{'userId': 'u1', 'itemId': 'a2', 'click': 0},
{'userId': 'u2', 'itemId': 'b1', 'click': 1},
{'userId': 'u2', 'itemId': 'b2', 'click': 1},
]
d = sc.parallelize(l)
基本上,第一个用户点击了两个项目中的一个,而第二个用户点击了这两个项目。
让我们按userId
按事件分组并处理函数中的事件。
def fun((user_id, events)):
events = list(events)
user_id = events[0]['userId']
clicked = set()
not_clicked = set()
for event in events:
item_id = event['itemId']
if event['click']==1:
clicked.add(item_id)
else:
not_clicked.add(item_id)
ret = {'userId': user_id, 'click': 1}
for item_id in clicked:
ret['itemId'] = item_id
yield ret
ret['click'] = 0
for item_id in not_clicked:
ret['itemId'] = item_id
yield ret
d1 = d\
.map(lambda obj: (obj['userId'], obj))\
.groupByKey()\
.flatMap(fun)
d1.collect()
这就是我得到的:
[{'click': 1, 'itemId': 'a1', 'userId': 'u1'},
{'click': 0, 'itemId': 'a2', 'userId': 'u1'},
{'click': 1, 'itemId': 'b1', 'userId': 'u2'},
{'click': 0, 'itemId': 'b2', 'userId': 'u2'}]
用户u2
的结果不正确。
有人可以解释为什么会发生这种情况以及防止这种情况的最佳做法是什么?
感谢。
答案 0 :(得分:2)
您看到的内容与Spark评估模型几乎没有关系。你的代码有问题。在本地执行它时很容易看到:
[{'click': 0, 'itemId': 'b2', 'userId': 'u2'},
{'click': 0, 'itemId': 'b2', 'userId': 'u2'}]
dict
正如你所看到的,这比你从Spark得到的更没意义。问题是你不应该使用可变数据。由于您修改了相同的yields
,所有(d1, d2) = list(fun((key, values)))
d1 is d2
都返回完全相同的对象:
True
import pickle
from itertools import islice, chain
gen = fun((key, values))
# The first batch is serialized
b1 = [pickle.dumps(x) for x in list(islice(gen, 0, 1))]
# Window is adjusted and the second batch is serialized
# fun exits with StopIteration when we try to take
# the second element in the batch
# element so code proceeds to ret['click'] = 0
b2 = [
pickle.dumps(x) for x in
# Use list to eagerly take a whole batch before pickling
list(islice(gen, 0, 2))
]
[pickle.loads(x) for x in chain(*[b1, b2])]
我认为与Spark相比的差异与批量序列化有关,其中第一项在函数退出之前以不同的批次序列化,并且有效的顺序或多或少是这样的:
[{'click': 1, 'itemId': 'b1', 'userId': 'u2'},
{'click': 0, 'itemId': 'b2', 'userId': 'u2'}]
for item_id in clicked:
yield {'userId': user_id, 'click': 1, 'item_id': item_id}
for item_id in not_clicked:
yield {'userId': user_id, 'click': 0, 'item_id': item_id}
但是如果你想要一个明确的确认,你必须自己检查(用一个等待所有数据的批量序列化器替换)。
如何解决?只是不要使用相同的字典。而是在循环内初始化一个新的:
/// <summary>
/// Execute each of the specified action, and if the action is failed, go and executes the next action.
/// </summary>
/// <param name="actions">The actions.</param>
public static void OnErrorResumeNext(params Action[] actions)
{
OnErrorResumeNext(actions: actions, returnExceptions: false);
}
/// <summary>
/// Execute each of the specified action, and if the action is failed go and executes the next action.
/// </summary>
/// <param name="returnExceptions">if set to <c>true</c> return list of exceptions that were thrown by the actions that were executed.</param>
/// <param name="putNullWhenNoExceptionIsThrown">if set to <c>true</c> and <paramref name="returnExceptions"/> is also <c>true</c>, put <c>null</c> value in the returned list of exceptions for each action that did not threw an exception.</param>
/// <param name="actions">The actions.</param>
/// <returns>List of exceptions that were thrown when executing the actions.</returns>
/// <remarks>
/// If you set <paramref name="returnExceptions"/> to <c>true</c>, it is possible to get exception thrown when trying to add exception to the list.
/// Note that this exception is not handled!
/// </remarks>
public static Exception[] OnErrorResumeNext(bool returnExceptions = false, bool putNullWhenNoExceptionIsThrown = false, params Action[] actions)
{
var exceptions = returnExceptions ? new Collections.GenericArrayList<Exception>() : null;
foreach (var action in actions)
{
Exception exp = null;
try { action.Invoke(); }
catch (Exception ex) { if(returnExceptions) { exp = ex; } }
if (exp != null || putNullWhenNoExceptionIsThrown) { exceptions.Add(exp); }
}
return exceptions?.ToArray();
}