我有一些将数据写入文件的pandas groupby函数,但由于某种原因,我将冗余数据写入文件。这是代码:
def item_grouper(df):
# Get the frequency of each tag applied to the item
tag_counts = df['tag'].value_counts()
# Get the most frequent tag (or tags, assuming a tie)
max_tags = tag_counts[tag_counts==tag_counts.max()]
# Get the total nummber of annotations for the item
total_anno = len(df)
# Now, process each user who tagged the item
return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)
# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
# subtract user's annoations from total annoations for the item
total_anno = total_anno - len(df)
# calculate weight
weight = np.log10(total_anno)
# check if user has used (one of) the top tag(s), and adjust max_tag_count
if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
max_tag_count = float(max_tags[0]-1)
else:
max_tag_count = float(max_tags[0])
# for each annotation...
for i,row in df.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / max_tag_count
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
return df
因此,一个分组函数按照iid(项目ID)对数据进行分组,进行一些处理,然后通过uid(user_id)对每个子数据帧进行分组,进行一些计算,并写入输出文件。现在,输出文件在原始数据帧中每行应该只有一行,但它不是!我不断地将相同的数据写入文件。例如,如果我跑:
out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()
输出应该有1000行(代码只在数据帧中每行写一行),但结果输出文件有1,997行。查看文件显示完全相同的行写入多次(2-4)次,看似随机(即并非所有行都是双写的)。知道我在这里做错了吗?
答案 0 :(得分:4)
请参阅适用的docs。 Pandas将在第一组上调用该函数两次(以确定快速/慢速代码路径之间),因此对于第一组,函数(IO)的副作用将发生两次。
这里你最好的选择可能是直接迭代这些组,如下所示:
for group_name, group_df in df.head(1000).groupby('iid'):
item_grouper(group_df)
答案 1 :(得分:3)
我同意chrisb对问题的认定。作为一种更干净的方法,请考虑让user_grouper()
函数不保存任何值,而是返回这些值。结构为
def user_grouper(df, ...):
(...)
df['max_tag_count'] = some_calculation
return df
results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')