Question

我正在使用Pandas处理大量数据。我想找到最快的方法来获取ID为

的DataFrame中的第一行

I have 2 DataFrame:

school_detail
school_id detail1 detail2
1         d11     d21 
2         d12     d22 
2         d13     d23
4         d14     d24
...
It has more than 20 million rows

schools
id school_name
1  name1 
2  name2
3  name3
4  name4
...
It has 3 million rows

我需要循环遍历school_detail中的所有行以设置每行的类型。

def get_type(s_detail):
   # I need to get school name here to calculate the type so I use
   school = schools[schools.id == s_detail.school_id] # To get school by id

school_detail['type'] = school_detail.apply(lambda x: get_type(x), axis=1)

我已经使用％prun 来检查id的功能上学时间。它大约 0.03秒

当我使用 10000行的school_detail运行时。它需要 43秒。

如果我跑20密排。可能需要几个小时。

我的问题：

我希望找到更好的方法通过ID 上学，以使其更快地运行。

id列是唯一的。大熊猫在这一栏中使用二进制搜索吗？

Answer 1

以下是如何操作的示例。它应该在大型数据集上快速，因为它不使用任何循环或特定功能。它使用pandas loc函数。

import pandas as pd
from StringIO import StringIO

data_school_detail = \
"""school_id,detail1,detail2
1,d11,d21
2,d12,d22
2,d13,d23
4,d14,d24"""

data_schools = \
"""id,school_name
1,name1
2,name2
3,name3
4,name4"""

# Creation of the dataframes
school_detail = pd.read_csv(StringIO(data_school_detail),sep = ',')
schools       = pd.read_csv(StringIO(data_schools),sep = ',', index_col = 0)
# Create a dataframe containing the schools data to be applied on
# dataframe school_detail
res = schools.loc[school_detail['school_id']]
# Reset index with school_detail index
res.index = school_detail.index
# Rename column as presented in the question
res.columns = ['type']
# Add the columns to dataframe school_detail
school_detail = school_detail.join(res)

school_detail现在将包含

   school_id detail1 detail2   type
0          1     d11     d21  name1
1          2     d12     d22  name2
2          2     d13     d23  name2
3          4     d14     d24  name4

pandas在条件中获得DataFrame的第一行

1 个答案: