Question

我正在创建一个旨在分析Wiki转储内容的程序。它必须统计每月编辑5篇以上文章的用户数。这是我的数据框：

{'revision_id': {0: 17447, 1: 23240, 2: 23241, 3: 23242, 4: 23243,
                 5: 23245, 6: 24401, 7: 3055, 8: 3056, 9: 3057},
 'page_id': {0: 4433, 1: 6639, 2: 6639, 3: 6639, 4: 6639, 5: 6639, 6: 6639, 7: 1896, 8: 1896, 9: 1896},
 'page_title': {0: 'Slow Gin Finn', 1: '43 con Leche', 2: '43 con Leche', 3: '43 con Leche', 4: '43 con Leche',
                5: '43 con Leche', 6: '43 con Leche', 7: '57 Chevy', 8: '57 Chevy', 9: '57 Chevy'},
 'page_ns': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 'timestamp': {0: '2011-02-02 23:16:11', 1: '2014-03-25 00:48:27', 2: '2014-03-25 00:48:43',
               3: '2014-03-25 00:49:48', 4: '2014-03-25 00:50:22', 5: '2014-03-25 00:57:02',
               6: '2014-08-11 16:47:53', 7: '2005-04-28 22:32:02', 8: '2005-04-29 03:42:39',
               9: '2006-04-05 12:19:00'},
 'contributor_id': {0: 3096602, 1: 1416077, 2: 1416077, 3: 1416077, 4: 1416077, 5: 1416077, 6: 1416077, 7: 740443,
                    8: 740443, 9: 740560},
 'contributor_name': {0: 'Babyjabba', 1: 'Sings-With-Spirits', 2: 'Sings-With-Spirits', 3: 'Sings-With-Spirits',
                      4: 'Sings-With-Spirits', 5: 'Sings-With-Spirits', 6: 'Sings-With-Spirits', 7: 'FlexiSoft',
                      8: 'FlexiSoft', 9: 'Vampiric.Media'},
 'bytes': {0: 558, 1: 284, 2: 288, 3: 339, 4: 339, 5: 374, 6: 378, 7: 294, 8: 238, 9: 268}}

由8列组成：revision_id，page_id，page_title，page_ns，timestamp，contributor_id，{{1} }和contributor_name。

我具有以下代码，以便处理Wiki转储并将其放入数据框，然后，为了获取每个用户每月编辑的页面数，我创建了一个{{1 }}和bytes。然后，我设法创建另一个数据框，其中仅包含每月拥有5个以上版本的用户：

timestamp

一旦有了df2数据框，我就想应用此lambda表达式，以便知道每月有多少个用户使用5个以上版本：

contributor_name

但是它不起作用。 ¿有人可以帮我完成这项任务吗？

Answer 1

我已经整理了一下代码，很难阅读和理解正在发生的事情。看看here，了解有关如何格式化/编写更可能帮助您的问题的提示。

import pandas as pd 
df = pd.read_csv('data.csv', sep=';', quotechar='|', index_col='revision_id') 
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Filter out anonymous users: 
df = df[df['contributor_name'] != 'Anonymous']
# get the number of edits each user has done each month
monthly_edits_per_user = df.groupby([pd.Grouper(key='timestamp', freq='MS'),
                                    'contributor_name']).size()
# filter users with number >= requested 
df2 = monthly_edits_per_user[monthly_edits_per_user >= 5].to_frame(name='pages_edited').reset_index()

这将产生：

   timestamp    contributor_name  pages_edited
0 2014-03-01  Sings-With-Spirits             5

我在此处添加了更多虚拟数据以显示下一个聚合：

   timestamp    contributor_name  pages_edited
0 2014-03-01  Sings-With-Spirits             5
1 2014-05-01                 foo             7
2 2014-05-01                 bar            10
3 2014-06-01                 foo             5
4 2014-10-01                 baz             8

现在您可以使用以下方法向此DataFrame添加新列：

df2['monthly_sum'] = df2.groupby('timestamp')['pages_edited'].transform(sum)

   timestamp    contributor_name  pages_edited  monthly_sum
0 2014-03-01  Sings-With-Spirits             5            5
1 2014-05-01                 foo             7           17
2 2014-05-01                 bar            10           17
3 2014-06-01                 foo             5            5
4 2014-10-01                 baz             8            8

df2['monthly_sum_per_user'] = df2.groupby(['timestamp', 'contributor_name'])['pages_edited'].transform(sum)

   timestamp    contributor_name  pages_edited  monthly_sum  monthly_sum_per_user
0 2014-03-01  Sings-With-Spirits             5            5                     5
1 2014-05-01                 foo             7           17                     7
2 2014-05-01                 bar            10           17                    10
3 2014-06-01                 foo             5            5                     5
4 2014-10-01                 baz             8            8                     8

将lambda表达式应用于groupby对象时出现问题

1 个答案: