我有一个非常大的数据集(>一千万行)。下面显示一个5行的小示例-在这里,我能够让Pandas在存在术语列表的列中对某些给定术语进行计数。对于运行Pandas的单核计算机,一切都很好。我得到了预期的结果(10行)。但是,在同一个小型数据集上(我在此处显示),该数据集有5行,当使用Dask进行实验时,会进行计数,吐出10行以上(基于分区数)。这是代码。如果有人可以指导我我误解/错误的地方,将不胜感激。
def compute_total(df, term_list, cap_list):
terms_counter = Counter(chain.from_iterable(df['Terms']))
terms_series = pd.Series(terms_counter)
terms_df = pd.DataFrame({'Term': terms_series.index, 'Count': terms_series.values})
df1 = terms_df[terms_df['Term'].isin(term_list)]
product_terms = product(term_list, cap_list)
df_cp = pd.DataFrame(product_terms, columns=['Terms', 'Capability'])
tjt_df = df_cp.set_index('Terms').combine_first(df1.set_index('Term')).reset_index()
tjt_df.rename(columns={'index': 'Term'}, inplace=True)
tjt_df['Count'] = tjt_df['Count'].fillna(0.0) # convert all NaN to 0.0
return tjt_df
d = {'Title': {0: 'IRC do consider this.',
1: 'we’re simply taking screenshot',
2: 'Why does irc select topics?',
3: 'Is this really a screenshot?',
4: 'how irc is doing this?'},
'Terms': {0: ['tech', 'channel', 'tech'],
1: ['channel', 'findwindow', 'Italy', 'findwindow'],
2: ['Detroit', 'topic', 'seats', 'topic'],
3: ['tech', 'topic', 'printwindow', 'Boston', 'window'],
4: ['privmsg', 'wheel', 'privmsg']}}
df = pd.DataFrame.from_dict(d)
term_list = ['channel', 'topic', 'findwindow', 'printwindow', 'privmsg']
cap_list = ['irc', 'screenshot']
Term Capability Count
0 channel irc 2.0
1 channel screenshot 2.0
2 findwindow irc 2.0
3 findwindow screenshot 2.0
4 printwindow irc 1.0
5 printwindow screenshot 1.0
6 privmsg irc 2.0
7 privmsg screenshot 2.0
8 topic irc 3.0
9 topic screenshot 3.0
注意:对于npartition,我尝试了num_cores = 1,我得到了预期的结果。如果将num_cores更改为大于1的任何值,则会得到我不理解的结果。例如:当num_cores = 2时,结果df有20行(好吧...我明白了)。当num_cores = 3或4时,我仍然得到20行。当num_cores = 5 ... 16时,我得到40行!没有尝试更多...
num_cores = 8
ddf = dd.from_pandas(df, npartitions=num_cores * 1)
meta = make_meta({'Term': 'U', 'Capability': 'U', 'Count': 'i8'}, index=pd.Index([], 'i8'))
count_df = ddf.map_partitions(compute_total, term_list, cap_list, meta=meta).compute(scheduler='processes')
print(count_df)
print(count_df.shape)
Term Capability Count
0 channel irc 1.0
1 channel screenshot 1.0
2 findwindow irc 0.0
3 findwindow screenshot 0.0
4 printwindow irc 0.0
5 printwindow screenshot 0.0
6 privmsg irc 0.0
7 privmsg screenshot 0.0
8 topic irc 0.0
9 topic screenshot 0.0
0 channel irc 1.0
1 channel screenshot 1.0
2 findwindow irc 2.0
3 findwindow screenshot 2.0
4 printwindow irc 0.0
5 printwindow screenshot 0.0
6 privmsg irc 0.0
7 privmsg screenshot 0.0
8 topic irc 0.0
9 topic screenshot 0.0
0 channel irc 0.0
1 channel screenshot 0.0
2 findwindow irc 0.0
3 findwindow screenshot 0.0
4 printwindow irc 0.0
5 printwindow screenshot 0.0
6 privmsg irc 0.0
7 privmsg screenshot 0.0
8 topic irc 2.0
9 topic screenshot 2.0
0 channel irc 0.0
1 channel screenshot 0.0
2 findwindow irc 0.0
3 findwindow screenshot 0.0
4 printwindow irc 1.0
5 printwindow screenshot 1.0
6 privmsg irc 2.0
7 privmsg screenshot 2.0
8 topic irc 1.0
9 topic screenshot 1.0
(40, 3)
观察:在观察了这个相当长的结果数据帧之后,我想我对其进行了最后一次计算以获得所需的结果。仅按术语,能力和总和分组。我会得到预期的结果(有点)。
df1 = df.groupby(['Term', 'Capability'])['Count'].sum()
但是,想知道是否可以使用Dask以干净的方式完成此操作。我知道这个问题不是一个“令人尴尬的并行”问题-意思是,需要对整个数据集进行全局查看才能获得计数。因此,我现在必须以“映射->缩小”方式进行处理。有没有更清洁的方法?