我在python pandas中有两个数据帧,如下所示:
df1是原始数据集
df2是df1
的列信息import pandas as pd
df1 = pd.DataFrame()
data_list = ['AA1','AAABB','AACCCDDDD', 'AACCCDDDDEEEEE','AA111','AA11222']
df1['DATA'] = data_list
df1是原始数据的数据集,如下所示:
df1
-------------------
DATA
-------------------
0 AA1
1 AAABB
2 AACCCDDDD
3 AACCCDDDDEEEEE
4 AA111
5 AA11222
df2代码:
df2 = pd.DataFrame()
data_list = ['AA1','AAAB','AACCC']
info_list = ['TYPE1','TYPE2','TYPE3']
size_list = [3, 4, 5]
df2['DATA_CLASS'] = data_list
df2['DATA_INFO'] = info_list
df2['DATA_SIZE'] = size_list
df2具有df1的'DATA'
的列信息 像这样的df2:
df2
-----------------------------
DATA_CLASS DATA_INFO DATA_SIZE
0 AA1 TYPE1 3
1 AAAB TYPE2 4
2 AACCC TYPE3 5
我想使用df2 ['DATA_CLASS','DATA_SIZE']在df1中放入'DATA_INFO'。
所以我这样写:
df1['DATA_INFO'] = ''
for idx, row in df2.iterrows():
size = row['DATA_SIZE']
df1.loc[df1.DATA.str[:size] == row['DATA_CLASS'], 'DATA_INFO'] = row['DATA_INFO']
因此df1有新列'DATA_INFO':
DATA DATA_INFO
----------------------------------
0 AA1 TYPE1
1 AAABB TYPE2
2 AACCCDDDD TYPE3
3 AACCCDDDDEEEEE TYPE3
4 AA111 TYPE1
5 AA11222 TYPE1
但我在使用dataframe .loc函数时遇到了问题。
如果df1行超过100,000且df2行超过10,000
,则需要很长时间才能处理我认为数据帧的iterrows()是延迟的主要原因
有没有人知道如何在df1中解决put数据类型 不使用.loc功能?
答案 0 :(得分:0)
我认为您可以先创建df1
的新列,其中所有可能的子字符串都按DATA_SIZE
列的unique
值,然后stack
df1
和{{ 3}}与merge
。如果订购很重要,请使用drop_duplicates
:
for i in df2['DATA_SIZE'].unique():
#print i
df1.loc[:, i] = df1['DATA'].str[:i]
print df1
DATA 3 4 5
0 AA1 AA1 AA1 AA1
1 AAABB AAA AAAB AAABB
2 AACCCDDDD AAC AACC AACCC
3 AACCCDDDDEEEEE AAC AACC AACCC
4 AA111 AA1 AA11 AA111
5 AA11222 AA1 AA11 AA112
df3 = df1.set_index('DATA').stack().reset_index(level=1,drop=True).reset_index(name='MATCH')
print df3
DATA MATCH
0 AA1 AA1
1 AA1 AA1
2 AA1 AA1
3 AAABB AAA
4 AAABB AAAB
5 AAABB AAABB
6 AACCCDDDD AAC
7 AACCCDDDD AACC
8 AACCCDDDD AACCC
9 AACCCDDDDEEEEE AAC
10 AACCCDDDDEEEEE AACC
11 AACCCDDDDEEEEE AACCC
12 AA111 AA1
13 AA111 AA11
14 AA111 AA111
15 AA11222 AA1
16 AA11222 AA11
17 AA11222 AA112
df = pd.merge(df3, df2, left_on="MATCH", right_on="DATA_CLASS").drop_duplicates()
df = df[['DATA','DATA_INFO']]
print df
DATA DATA_INFO
0 AA1 TYPE1
3 AA111 TYPE1
4 AA11222 TYPE1
5 AAABB TYPE2
6 AACCCDDDD TYPE3
7 AACCCDDDDEEEEE TYPE3
#if order of column DATA is important
print df.set_index('DATA').reindex(df1.set_index('DATA').index).reset_index()
DATA DATA_INFO
0 AA1 TYPE1
1 AAABB TYPE2
2 AACCCDDDD TYPE3
3 AACCCDDDDEEEEE TYPE3
4 AA111 TYPE1
5 AA11222 TYPE1