Question

我对熊猫很新，但我一直在阅读它，以及处理大数据时的速度有多快。

我设法创建了一个数据帧，现在我有一个看起来像这样的pandas数据框：

    0     1
0    1    14
1    2    -1
2    3  1817
3    3    29
4    3    25
5    3     2
6    3     1
7    3    -1
8    4    25
9    4    24
10   4     2
11   4    -1
12   4    -1
13   5    25
14   5     1

Columns 0是作者的ID，column 1是作者对出版物的引用次数（-1表示零引用）。每行代表作者的不同出版物。

我试图为这些作者计算h-index。 h-index被定义为作者至少引用h次的h出版物的数量。所以作者：

作者1的h-index为1

作者2的h-index为0

作者3的h-index为3

作者4的h-index为2

作者5的h-index为1

这是我目前正在进行的方式，涉及大量循环：

current_author=1
hindex=0

for index, row in df.iterrows():
    if row[0]==current_author:
        if row[1]>hindex:
            hindex+=1
    else:
        print "author ",current_author," has h-index:", hindex
        current_author+=1
        hindex=0
        if row[1]>hindex:
            hindex+=1

print "author ",current_author," has h-index:", hindex

我的实际数据库有超过300万作者。如果我为每一个循环，这将花费数天来计算。我想弄清楚你认为解决这个问题的最快方法是什么？

提前致谢！

Answer 1

我将您的专栏重命名为＆＃39; author＆＃39;和＃＆quot;引用＆＃39;在这里，我们可以通过作者分组然后应用lambda，这里lambda将引用次数与值进行比较，如果为真，则生成1或0，我们可以对此求和：

In [104]:

df['h-index'] = df.groupby('author')['citations'].transform( lambda x: (x >= x.count()).sum() )

df
Out[104]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

编辑正如@Julien Spronck所指出的那样，如果作者4有引文3,3,3，则上述作品无法正常工作。通常你不能访问组间索引，但是我们可以将引用值与rank进行比较，这是一个伪索引，但只有当引用值是唯一的时才有效：

In [129]:

df['h-index'] = df.groupby('author')['citations'].transform(lambda x: ( x >= x.rank(ascending=False, method='first') ).sum() )

df
Out[129]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

Answer 2

我不知道它是否足够快，但这是一个适合你的解决方案。在此代码中，我首先按作者对数据框进行排序，然后通过减少引用次数。我添加了一个列，其中包含与每个作者的纸张编号相对应的新索引。我通过比较纸张编号和引文编号来创建另一列。剩下要做的就是对每位作者的最后一栏求和。

import numpy as np

df2 = df.sort([0,1],ascending=[1,0])
groups = df2.groupby(0)
ind2 = np.array([np.arange(len(g))+1 for g in groups.groups.itervalues()])
df2['newindex'] = np.hstack(ind2)
df2['condition'] = df2[1]>=df2['newindex']
hindex = df2.groupby(0).sum()['condition']

## 0
## 1    1
## 2    0
## 3    3
## 4    2
## 5    1

在pandas DataFrame中计算h-index（作者出版物的影响/生产力）的有效方法

2 个答案: