我正在使用Pandas处理大量数据。我想找到最快的方法来获取ID为
的DataFrame中的第一行I have 2 DataFrame:
school_detail
school_id detail1 detail2
1 d11 d21
2 d12 d22
2 d13 d23
4 d14 d24
...
It has more than 20 million rows
schools
id school_name
1 name1
2 name2
3 name3
4 name4
...
It has 3 million rows
我需要循环遍历school_detail中的所有行以设置每行的类型。
def get_type(s_detail):
# I need to get school name here to calculate the type so I use
school = schools[schools.id == s_detail.school_id] # To get school by id
school_detail['type'] = school_detail.apply(lambda x: get_type(x), axis=1)
我已经使用%prun 来检查id的功能上学时间。它大约 0.03秒
当我使用 10000行的school_detail运行时。它需要 43秒。
如果我跑20密排。可能需要几个小时。
我的问题:
我希望找到更好的方法通过ID 上学,以使其更快地运行。
id列是唯一的。大熊猫在这一栏中使用二进制搜索吗?
答案 0 :(得分:0)
以下是如何操作的示例。它应该在大型数据集上快速,因为它不使用任何循环或特定功能。它使用pandas loc函数。
import pandas as pd
from StringIO import StringIO
data_school_detail = \
"""school_id,detail1,detail2
1,d11,d21
2,d12,d22
2,d13,d23
4,d14,d24"""
data_schools = \
"""id,school_name
1,name1
2,name2
3,name3
4,name4"""
# Creation of the dataframes
school_detail = pd.read_csv(StringIO(data_school_detail),sep = ',')
schools = pd.read_csv(StringIO(data_schools),sep = ',', index_col = 0)
# Create a dataframe containing the schools data to be applied on
# dataframe school_detail
res = schools.loc[school_detail['school_id']]
# Reset index with school_detail index
res.index = school_detail.index
# Rename column as presented in the question
res.columns = ['type']
# Add the columns to dataframe school_detail
school_detail = school_detail.join(res)
school_detail
现在将包含
school_id detail1 detail2 type
0 1 d11 d21 name1
1 2 d12 d22 name2
2 2 d13 d23 name2
3 4 d14 d24 name4