此函数将应用于数据框

中的每个项目

def item_grouper(df):
    # Get the frequency of each tag applied to the item
    tag_counts = df['tag'].value_counts() 
    # Get the most frequent tag (or tags, assuming a tie)
    max_tags = tag_counts[tag_counts==tag_counts.max()]
    # Get the total nummber of annotations for the item
    total_anno = len(df)
    # Now, process each user who tagged the item
    return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)

# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
    # subtract user's annoations from total annoations for the item
    total_anno = total_anno - len(df)
    # calculate weight
    weight = np.log10(total_anno)
    # check if user has used (one of) the top tag(s), and adjust max_tag_count
    if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
        max_tag_count = float(max_tags[0]-1)
    else:
        max_tag_count = float(max_tags[0])
    # for each annotation...
    for i,row in df.iterrows():
        # calculate raw score
        raw_score = (tag_counts[row['tag']]-1) / max_tag_count
        # write to file
        out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
    return df

因此，一个分组函数按照iid（项目ID）对数据进行分组，进行一些处理，然后通过uid（user_id）对每个子数据帧进行分组，进行一些计算，并写入输出文件。现在，输出文件在原始数据帧中每行应该只有一行，但它不是！我不断地将相同的数据写入文件。例如，如果我跑：

out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()

输出应该有1000行（代码只在数据帧中每行写一行），但结果输出文件有1,997行。查看文件显示完全相同的行写入多次（2-4）次，看似随机（即并非所有行都是双写的）。知道我在这里做错了吗？

Answer 1

请参阅适用的docs。 Pandas将在第一组上调用该函数两次（以确定快速/慢速代码路径之间），因此对于第一组，函数（IO）的副作用将发生两次。

这里你最好的选择可能是直接迭代这些组，如下所示：

for group_name, group_df in df.head(1000).groupby('iid'):
    item_grouper(group_df)

Answer 2

我同意chrisb对问题的认定。作为一种更干净的方法，请考虑让user_grouper()函数不保存任何值，而是返回这些值。结构为

def user_grouper(df, ...):
    (...)
    df['max_tag_count'] = some_calculation
    return df

results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
    # calculate raw score
    raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
    # write to file
    out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')

Pandas groupby和文件写作问题

此函数将应用于数据框

2 个答案: