编辑:我重写了我所有的问题,试图更加清楚,并添加了简化的经过测试的代码以证明我的操作方式
你好
我想知道PySpark是否存在一种有效的方法,可以通过在另一列中进行过滤,计数不同的事件以及在时间窗口中进行的所有操作来从另一数据框中提取数据。
下面,我将通过一个详细的示例来说明我的所有操作方法,但是当面对数百万行和数千个不同用户的数据集时,几乎每个步骤都需要花费数小时/天才能完成。
我确定我做错了方法,因为我刚起步,所以如果有人可以回答这个问题,那确实会对我有帮助。
谢谢
为示例创建数据框
File "a.py", line 60
return
^
SyntaxError: 'return' outside function
首先,我为每个用户创建一个时间窗口数据帧列表:
schema = ['DateTimeStart','DateTimeEnd','User','DateTime']
df = sc.parallelize([
['2017-01-12 03:35:001', '2017-01-12 03:37:001','A', '2017-01-12 03:35:000'],
['2017-01-12 03:35:111', '2017-01-12 03:37:111','B', '2017-01-12 03:35:110'],
['2017-01-12 03:35:221', '2017-01-12 03:37:221','C', '2017-01-12 03:35:220'],
['2017-01-12 03:35:431', '2017-01-12 03:37:431','D', '2017-01-12 03:35:430'],
['2017-01-12 03:36:434', '2017-01-12 03:38:434','D', '2017-01-12 03:36:433'],
['2017-01-12 03:36:441', '2017-01-12 03:38:441','D', '2017-01-12 03:36:440'],
['2017-01-12 03:36:451', '2017-01-12 03:38:451','E', '2017-01-12 03:36:450'],
['2017-01-12 03:37:681', '2017-01-12 03:39:681','B', '2017-01-12 03:37:680'],
['2017-01-12 03:37:789', '2017-01-12 03:39:789','B', '2017-01-12 03:37:788'],
['2017-01-12 03:37:793', '2017-01-12 03:39:793','E', '2017-01-12 03:37:792'],
['2017-01-12 03:38:798', '2017-01-12 03:40:798','C', '2017-01-12 03:38:797'],
['2017-01-12 03:38:986', '2017-01-12 03:40:986','D', '2017-01-12 03:38:985'],
['2017-01-12 03:39:011', '2017-02-12 03:41:011','K', '2017-01-12 03:39:010'],
['2017-01-12 03:39:021', '2017-02-12 03:41:021','A', '2017-01-12 03:39:020'],
['2017-01-12 03:39:031', '2017-02-12 03:41:031','P', '2017-01-12 03:39:030'],
]).toDF(schema)
df.show()
+--------------------+--------------------+----+--------------------+
| DateTimeStart| DateTimeEnd|User| DateTime|
+--------------------+--------------------+----+--------------------+
|2017-01-12 03:35:001|2017-01-12 03:37:001| A|2017-01-12 03:35:000|
|2017-01-12 03:35:111|2017-01-12 03:37:111| B|2017-01-12 03:35:110|
|2017-01-12 03:35:221|2017-01-12 03:37:221| C|2017-01-12 03:35:220|
|2017-01-12 03:35:431|2017-01-12 03:37:431| D|2017-01-12 03:35:430|
|2017-01-12 03:36:434|2017-01-12 03:38:434| D|2017-01-12 03:36:433|
|2017-01-12 03:36:441|2017-01-12 03:38:441| D|2017-01-12 03:36:440|
|2017-01-12 03:36:451|2017-01-12 03:38:451| E|2017-01-12 03:36:450|
|2017-01-12 03:37:681|2017-01-12 03:39:681| B|2017-01-12 03:37:680|
|2017-01-12 03:37:789|2017-01-12 03:39:789| B|2017-01-12 03:37:788|
|2017-01-12 03:37:793|2017-01-12 03:39:793| E|2017-01-12 03:37:792|
|2017-01-12 03:38:798|2017-01-12 03:40:798| C|2017-01-12 03:38:797|
|2017-01-12 03:38:986|2017-01-12 03:40:986| D|2017-01-12 03:38:985|
|2017-01-12 03:39:011|2017-02-12 03:41:011| K|2017-01-12 03:39:010|
|2017-01-12 03:39:021|2017-02-12 03:41:021| A|2017-01-12 03:39:020|
|2017-01-12 03:39:031|2017-02-12 03:41:031| P|2017-01-12 03:39:030|
+--------------------+--------------------+----+--------------------+
然后,我遍历先前创建的时间窗口数据帧的每个列表,以获取每个时间窗口中每个用户的所有信息
Usernames = df.select(['User']).distinct().collect()
User_list = [str(Usernames[i]).split("'")[1] for i in range(len(Usernames))]
dfs_Users_windows = [df.filter(df.User == user).select('DateTimeStart','DateTimeEnd','User').collect() for user in User_list]
然后,我加入每个用户的每个数据框:(这样做的方法似乎很奇怪,但是我发现以2的方式加入数据框2并重复进行此操作比依次加入数据框要快。):< / em>
dfs_Users_windowed = [[df.filter(df.DateTime.between(str(dfs_Users_windows[j][i]).split("'")[1],
str(dfs_Users_windows[j][i]).split("'")[3]))
.groupBy(['User'])
.count()
.withColumnRenamed('User',
re.sub("\.|\-|\'| |\|", '_', str(dfs_Users_windows[j][i])[87:-2])+'_neighbours')
.withColumnRenamed('count', 'count_window'+str(i))
for i in range(len(dfs_Users_windows[j]))]
for j in range(len(dfs_Users_windows))]
然后,我只删除外部联接引入的“空”值
dfs_finals = []
for i in range(len(dfs_Users_windowed)):
list_to_loop = dfs_Users_windowed[i]
new_list_to_loop = []
list_len = len(list_to_loop)
while list_len>1:
if list_len%2 != 0:
for j in range(0,list_len-1,2):
new_list_to_loop.append(list_to_loop[j].join(list_to_loop[j+1],list_to_loop[j].columns[0],"outer"))
new_list_to_loop.append(list_to_loop[-1])
list_len = len(new_list_to_loop)
list_to_loop = new_list_to_loop
new_list_to_loop = []
else:
for j in range(0,list_len,2):
new_list_to_loop.append(list_to_loop[j].join(list_to_loop[j+1],list_to_loop[j].columns[0],"outer"))
list_len = len(new_list_to_loop)
list_to_loop = new_list_to_loop
new_list_to_loop = []
dfs_finals.append(list_to_loop)
结果是我得到了期望。用户E上的示例:
dfs_finals_clean = [dfs_finals[i][0].na.fill(0) for i in range(len(dfs_finals))]