我有2个数据帧如下:
data1如下所示:
id address
1 11123451
2 78947591
data2如下所示:
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
我想在data1中创建另一列名为" place"其中包含id来自的位置。 例如,在上述情况下, 对于id 1,我希望place列包含Y,对于id 2,我希望place列包含X. 会有很多ID来自同一个地方。有些ids没有匹配。
我正在尝试使用以下代码进行操作。
places = []
for index, row in data1.iterrows():
for idx, r in data2.iterrows():
if r['lowerbound_address'] <= row['address'] <= r['upperbound_address']:
places.append(r['place'])
这里的地址是浮点值。
它需要永远运行这段代码。这让我想知道我的代码是否正确,或者是否有更快的执行方式。
任何帮助将不胜感激。 谢谢!
答案 0 :(得分:2)
您可以先使用merge
使用cross
加入,然后按boolean indexing
过滤值。最后按drop
删除了不必要的列:
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
print (df)
id address place
1 1 11123451 Y
2 2 78947591 X
使用itertuples
的另一个解决方案,最后创建DataFrame.from_records
:
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
print (places)
[(1, 11123451, 'Y'), (2, 78947591, 'X')]
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
apply
的另一个解决方案:
def f(x):
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
df = data1.set_index('id')['address'].apply(f).reset_index()
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
编辑:
<强>计时强>:
N = 1000
:
如果某些值不在范围内,则解决方案b
和c
将被忽略。检查df1
的最后一行。
In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
1 loop, best of 3: 2.06 s per loop
In [74]: %timeit (a(df1a, df2a))
1 loop, best of 3: 82.2 ms per loop
In [75]: %timeit (b(df1b, df2b))
1 loop, best of 3: 3.17 s per loop
In [76]: %timeit (c(df1c, df2c))
100 loops, best of 3: 2.71 ms per loop
时间安排的代码:
np.random.seed(123)
N = 1000
data1 = pd.DataFrame({'id':np.arange(1,N+1),
'address': np.random.randint(N*10, size=N)}, columns=['id','address'])
#add last row with value out of range
data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
data1 = data1.astype(int)
print (data1.tail())
data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10),
'upperbound_address':np.arange(10,N*10+10, 10),
'place': np.random.randint(40, size=N)})
print (data2.tail())
df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()
def a(data1, data2):
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
return (df)
def b(data1, data2):
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
return (df)
def f(x):
#use for ... else for add NaN to values out of range
#http://stackoverflow.com/q/9979970/2901002
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
else:
return pd.Series([x, np.nan], index=['address','place'])
def c(data1,data2):
data1 = data1.sort_values('address')
data2 = data2.sort_values('lowerbound_address')
df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
return df.sort_values('id')
print (data1.set_index('id')['address'].apply(f).reset_index())
print (a(df1a, df2a))
print (b(df1b, df2b))
print (c(df1c, df2c))
只有c
DataFrame
的解决方案与大型N=1M
的效果非常好:
In [84]: %timeit (c(df1c, df2c))
1 loop, best of 3: 525 ms per loop
:
Error opening the file
有关merge_asof
的更多信息。