Pandas dataframe.loc方法太慢了

时间:2018-02-21 18:41:40

标签: python pandas pandas-groupby

我有一个+ 100K行的数据帧,如下所示:

   user  document
0  john      book
1  jane   article
2  jane      book
3  jane      book
4   jim   article
5  john      book
6   jim  blogpost
7  jane  blogpost
8  jane  blogpost
9  jane  blogpost

我需要这样的数据框:

      blogpost  article  book
john         1        3     0
jane         0        0     1
jim          4        0     2

也就是说,我需要每个user, document组合的下载次数。

我正在.groupby(['user', 'document']),然后使用df.loc设置下载编号:

df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)

grouped = records.groupby(['user', 'document'])
for elem in grouped:
    user, document = elem[0]
    downloads = len(elem[1])
    df.loc[user, document] = downloads

然而,df.loc在使用这种方式时非常慢...我已经注释掉了df.loc..行并发现循环完成得很快,所以几乎可以肯定它是df.loc { {1}}访问速度很慢。

如何更快地获得此结果?

最低工作示例:

records = pd.DataFrame([
    ('john', 'book'), 
    ('jane', 'article'),
    ('jane','book'),
    ('jane','book'),
    ('jim', 'article'), 
    ('john', 'book'),
    ('jim', 'blogpost'), 
    ('jane', 'blogpost'),
    ('jane', 'blogpost'),
    ('jane', 'blogpost')
    ], columns=['user', 'document'])
print(records)

users = list(set(records['user']))
users.sort()
documents = list(set(records['document']))
documents.sort()

print(users)
print(documents)

df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)
print(df)

grouped = records.groupby(['user', 'document'])
for elem in grouped:
    user, document = elem[0]
    downloads = len(elem[1])
    df.loc[user, document] = downloads

4 个答案:

答案 0 :(得分:3)

有很多方法可以实现这一目标,pivotpivot_tablecrosstabgroupby count

pd.crosstab(df.user,df.document)
Out[1283]: 
document  article  blogpost  book
user                             
jane            1         3     2
jim             1         1     0
john            0         0     2

答案 1 :(得分:1)

让我们试试:

df.set_index('user')['document'].str.get_dummies().sum(level=0)

输出:

      article  blogpost  book
user                         
john        0         0     2
jane        1         3     2
jim         1         1     0

答案 2 :(得分:1)

records.groupby(['user','document']).size().unstack('document').fillna(0)

答案 3 :(得分:1)

您可以通过转换数据框使用numpy访问单元格值 到一个numpy数组。此方法比.loc方法快。 但是,您确实需要知道列的位置。在下面 示例我想要B列中的值与A中的2对应 柱。

df = pd.DataFrame( {'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]} )

# Make sure our A and B are where we think they are (optional)
A = df.columns.get_loc('A')
B = df.columns.get_loc('B')

# Convert to numpy array
df = df.values

# Get the value
B_val = df[:,B][ df[:,A] == 2 ][0]  

# Convert back to dataframe (optional)
df = pd.DataFrame(df, columns = ['A','B','C'])

#B_val = 5

您还可以将数据框转换为字典并以这种方式访问​​值。这比.at []方法要快一点,比.loc []方法要快得多。

df = pd.DataFrame( {'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]} )

# Convert to dictionary
df = df.set_index('A').T.to_dict('list')
num = 2
B_val = df[num][0]