数据
我有一个包含5列的数据框:
origin_lat
,origin_lng
)dest_lat
,dest_lng
)我有一个矩阵M
,其中包含成对的原点和目的地纬度/经度。其中一些对存在于数据帧中,另一些则不存在。
目标
我的目标是双重的:
M
中选择数据框前四列中不存在的所有对,向其应用函数func
(计算得分列),并将结果附加到现有数据帧。 注意:我们不应该重新计算现有行的分数。M
中选择由选择矩阵dfs
定义的所有行。示例代码
# STEP 1: Generate example data
ctr_lat = 40.676762
ctr_lng = -73.926420
N = 12
N2 = 3
data = np.array([ctr_lat+np.random.random((N))/10,
ctr_lng+np.random.random((N))/10,
ctr_lat+np.random.random((N))/10,
ctr_lng+np.random.random((N))/10]).transpose()
# Example function - does not matter what it does
def func(x):
return np.random.random()
# Create dataframe
geocols = ['origin_lat','origin_lng','dest_lat','dest_lng']
df = pd.DataFrame(data,columns=geocols)
df['score'] = df.apply(func,axis=1)
这给了我一个像这样的数据框df
:
origin_lat origin_lng dest_lat dest_lng score
0 40.684887 -73.924921 40.758641 -73.847438 0.820080
1 40.703129 -73.885330 40.774341 -73.881671 0.104320
2 40.761998 -73.898955 40.767681 -73.865001 0.564296
3 40.736863 -73.859832 40.681693 -73.907879 0.605974
4 40.761298 -73.853480 40.696195 -73.846205 0.779520
5 40.712225 -73.892623 40.722372 -73.868877 0.628447
6 40.683086 -73.846077 40.730014 -73.900831 0.320041
7 40.726003 -73.909059 40.760083 -73.829180 0.903317
8 40.748258 -73.839682 40.713100 -73.834253 0.457138
9 40.761590 -73.923624 40.746552 -73.870352 0.867617
10 40.748064 -73.913599 40.746997 -73.894851 0.836674
11 40.771164 -73.855319 40.703426 -73.829990 0.010908
然后我可以人工创建选择矩阵M
,其中包含数据帧中存在的3行,而不包含3行。
# STEP 2: Generate data to select
# As an example, I select 3 rows that are part of the dataframe, and 3 that are not
data2 = np.array([ctr_lat+np.random.random((N2))/10,
ctr_lng+np.random.random((N2))/10,
ctr_lat+np.random.random((N2))/10,
ctr_lng+np.random.random((N2))/10]).transpose()
M = np.concatenate((data[4:7,:],data2))
矩阵M
如下所示:
array([[ 40.7612977 , -73.85348031, 40.69619549, -73.84620489],
[ 40.71222463, -73.8926234 , 40.72237185, -73.86887696],
[ 40.68308567, -73.84607722, 40.73001434, -73.90083107],
[ 40.7588412 , -73.87128079, 40.76750639, -73.91945371],
[ 40.74686156, -73.84804047, 40.72378653, -73.92207075],
[ 40.6922673 , -73.88275402, 40.69708748, -73.87905543]])
从这里开始,我不知道如何知道M
中哪些行df
不存在并添加它们。我不知道如何从df
中选择M
中的所有行。
观
我的想法是识别缺失的行,将其附加到df
得分nan
,然后重新计算nan
行的得分。但是,我不知道如何在没有循环矩阵M
的每个元素的情况下有效地选择这些行。
有什么建议吗? 非常感谢你的帮助!
答案 0 :(得分:7)
有没有理由不使用merge
?
df2 = pd.DataFrame(M, columns=geocols)
df = df.merge(df2, how='outer')
ix = df.score.isnull()
df.loc[ix, 'score'] = df.loc[ix].apply(func, axis=1)
它完全符合您的建议:使用nan分数添加缺失的行df
,标识nans,计算这些行的分数。
答案 1 :(得分:2)
因此,此解决方案会循环遍历M中的每一行,但不会循环每个元素。步骤是:
希望这有帮助 - 我意识到它仍然有一个循环,但我还没弄明白如何摆脱它。你的问题也只是说df可能很大,你想避免循环M的元素,这至少可以通过循环行来避免。
M_in_df = []
M_not_in_df = []
for m in M:
df_index = (df.iloc[:,:4].values == m).all(axis=1)
if df_index.any():
M_in_df.append(np.argmax(df_index))
else:
M_not_in_df.append(np.append(m, func(m)))
M_df = pd.DataFrame(M_not_in_df, columns=df.columns).append(df.iloc[M_in_df], ignore_index=True)
new_df = df.append(pd.DataFrame(M_not_in_df, columns=df.columns), ignore_index=True)
答案 2 :(得分:2)
将M
转换为DataFrame
,与df
结转:
df2 = pd.DataFrame(M, columns=geocols)
df3 = pd.concat([df, df2], ignore_index=True)
仅根据geocols
中的cols删除重复行:
df3 = df3.drop_duplicates(subset=geocols)
获取NaN
的{{1}}行的掩码:
score
将分数应用于蒙版行,并存储在m = df3.score.isnull()
:
df3
你将获得一个SettingWithCopyWarning,但它可以工作。
答案 3 :(得分:2)
您正在进行地理空间分析,我认为采用一些标准方法非常重要。也就是说,每个行/条目都由一对坐标标识,因此,将它们转换为WKT,会很有意义。
使用WKT,您需要检查的是旧数据中是否已找到新数据的WKT:
# from shapely.wkt import dumps
# import shapely.geometry as sg
In [27]: M = np.array([[ 40.761998, -73.898955, 40.767681, -73.865001],
...: [ 40.736863, -73.859832, 40.681693, -73.907879],
...: [ 40.761298, -73.853480, 40.696195, -73.846205],
...: [ 40.7588412 , -73.87128079, 40.76750639, -73.91945371],
...: [ 40.74686156, -73.84804047, 40.72378653, -73.92207075],
...: [ 40.6922673 , -73.88275402, 40.69708748, -73.87905543]])
In [28]: df
Out[28]:
origin_lat origin_lng dest_lat dest_lng score
0 40.684887 -73.924921 40.758641 -73.847438 0.820080
1 40.703129 -73.885330 40.774341 -73.881671 0.104320
2 40.761998 -73.898955 40.767681 -73.865001 0.564296
3 40.736863 -73.859832 40.681693 -73.907879 0.605974
4 40.761298 -73.853480 40.696195 -73.846205 0.779520
5 40.712225 -73.892623 40.722372 -73.868877 0.628447
6 40.683086 -73.846077 40.730014 -73.900831 0.320041
7 40.726003 -73.909059 40.760083 -73.829180 0.903317
8 40.748258 -73.839682 40.713100 -73.834253 0.457138
9 40.761590 -73.923624 40.746552 -73.870352 0.867617
10 40.748064 -73.913599 40.746997 -73.894851 0.836674
11 40.771164 -73.855319 40.703426 -73.829990 0.010908
# Generate WKT for the original dataframe
In [29]: df['wkt'] = df.apply(lambda x: dumps(sg.LineString([x[:2], x[2:4]]),
rounding_precision=6),
axis=1)
In [29]: df
Out[29]:
origin_lat origin_lng dest_lat dest_lng score wkt
0 40.684887 -73.924921 40.758641 -73.847438 0.820080 LINESTRING (40.684887 -73.924921, 40.758641 -7...
1 40.703129 -73.885330 40.774341 -73.881671 0.104320 LINESTRING (40.703129 -73.885330, 40.774341 -7...
2 40.761998 -73.898955 40.767681 -73.865001 0.564296 LINESTRING (40.761998 -73.898955, 40.767681 -7...
3 40.736863 -73.859832 40.681693 -73.907879 0.605974 LINESTRING (40.736863 -73.859832, 40.681693 -7...
4 40.761298 -73.853480 40.696195 -73.846205 0.779520 LINESTRING (40.761298 -73.853480, 40.696195 -7...
5 40.712225 -73.892623 40.722372 -73.868877 0.628447 LINESTRING (40.712225 -73.892623, 40.722372 -7...
6 40.683086 -73.846077 40.730014 -73.900831 0.320041 LINESTRING (40.683086 -73.846077, 40.730014 -7...
7 40.726003 -73.909059 40.760083 -73.829180 0.903317 LINESTRING (40.726003 -73.909059, 40.760083 -7...
8 40.748258 -73.839682 40.713100 -73.834253 0.457138 LINESTRING (40.748258 -73.839682, 40.713100 -7...
9 40.761590 -73.923624 40.746552 -73.870352 0.867617 LINESTRING (40.761590 -73.923624, 40.746552 -7...
10 40.748064 -73.913599 40.746997 -73.894851 0.836674 LINESTRING (40.748064 -73.913599, 40.746997 -7...
11 40.771164 -73.855319 40.703426 -73.829990 0.010908 LINESTRING (40.771164 -73.855319, 40.703426 -7...
# Generate WKT for the new data
In [30]: new_wkt = [dumps(sg.LineString(r.reshape(2,2)),
rounding_precision=6)
for r in M]
In [30]: np.isin(new_wkt, df.wkt)
Out[30]: array([ True, True, True, False, False, False], dtype=bool)
# Only put the WKT not found in the original dataframe into the a new dataframe
In [31]: df2 = pd.DataFrame(M[np.isin(new_wkt, df.wkt)], columns=['origin_lat', 'origin_lng', 'dest_lat', 'dest_lng'])
In [32]: df2['wkt'] = np.array(new_wkt)[np.isin(new_wkt, df.wkt)]
# Only do calculation for the new entries
In [33]: df2['score'] = 0 # or do whatever score calculation needed
# Combine the new to the old
In [34]: df.append(df2)
Out[34]:
dest_lat dest_lng origin_lat origin_lng score wkt
0 40.758641 -73.847438 40.684887 -73.924921 0.820080 LINESTRING (40.684887 -73.924921, 40.758641 -7...
1 40.774341 -73.881671 40.703129 -73.885330 0.104320 LINESTRING (40.703129 -73.885330, 40.774341 -7...
2 40.767681 -73.865001 40.761998 -73.898955 0.564296 LINESTRING (40.761998 -73.898955, 40.767681 -7...
3 40.681693 -73.907879 40.736863 -73.859832 0.605974 LINESTRING (40.736863 -73.859832, 40.681693 -7...
4 40.696195 -73.846205 40.761298 -73.853480 0.779520 LINESTRING (40.761298 -73.853480, 40.696195 -7...
5 40.722372 -73.868877 40.712225 -73.892623 0.628447 LINESTRING (40.712225 -73.892623, 40.722372 -7...
6 40.730014 -73.900831 40.683086 -73.846077 0.320041 LINESTRING (40.683086 -73.846077, 40.730014 -7...
7 40.760083 -73.829180 40.726003 -73.909059 0.903317 LINESTRING (40.726003 -73.909059, 40.760083 -7...
8 40.713100 -73.834253 40.748258 -73.839682 0.457138 LINESTRING (40.748258 -73.839682, 40.713100 -7...
9 40.746552 -73.870352 40.761590 -73.923624 0.867617 LINESTRING (40.761590 -73.923624, 40.746552 -7...
10 40.746997 -73.894851 40.748064 -73.913599 0.836674 LINESTRING (40.748064 -73.913599, 40.746997 -7...
11 40.703426 -73.829990 40.771164 -73.855319 0.010908 LINESTRING (40.771164 -73.855319, 40.703426 -7...
0 40.767681 -73.865001 40.761998 -73.898955 0.000000 LINESTRING (40.761998 -73.898955, 40.767681 -7...
1 40.681693 -73.907879 40.736863 -73.859832 0.000000 LINESTRING (40.736863 -73.859832, 40.681693 -7...
2 40.696195 -73.846205 40.761298 -73.853480 0.000000 LINESTRING (40.761298 -73.853480, 40.696195 -7...
补充意见:
score
列(如果涉及此类计算)precision
至6)df
(或df.append(df2)
部分)的维度都会更改。本质上,这意味着如果这种'更新发生了很多事情,性能将会上升。答案 4 :(得分:0)
让我们的第一个形状M作为名为df_temp的数据框:
In [1]: df_temp=pd.DataFrame(M,columns=('origin_lat','origin_lng','dest_lat','dest_lng'))
In [2]: df_temp
Out[2]:
origin_lat origin_lng dest_lat dest_lng
0 40.724872 -73.843830 40.768628 -73.875295
1 40.744625 -73.858908 40.770675 -73.915897
2 40.683664 -73.916877 40.700891 -73.904609
3 40.774582 -73.871768 40.703176 -73.833921
4 40.680940 -73.839505 40.752041 -73.882552
5 40.677105 -73.897702 40.743859 -73.883683
使用合并,我们现在可以轻松跟踪d中M:
中的元素In [3]: dfs = df.merge(df_temp,on=['origin_lat','origin_lng','dest_lat','dest_lng'],
right_index=True)
In [4]: dfs
Out[4]:
origin_lat origin_lng dest_lat dest_lng score
4 40.724872 -73.843830 40.768628 -73.875295 0.705182
5 40.744625 -73.858908 40.770675 -73.915897 0.724282
6 40.683664 -73.916877 40.700891 -73.904609 0.645395
注意:right_index参数允许我们保持df的索引,以便我们知道哪些df行也在M
最后,我们可以在df_temp中添加不在df中的行:
# Compute the scores of df_temp
df_temp['score'] = [func(df_temp.iloc[i]) for i in range(len(df_temp))]
# Append elements of df_temp to df
df.append(df_temp,ignore_index=True)
# Erase duplicates
df.drop_duplicates(subset=['origin_lat','origin_lng','dest_lat','dest_lng'])
注意: drop_duplicates中的子集就在这里,因为您的得分函数是非确定性的