从pandas数据框中选择行,在多列上使用numpy 2D数组

时间:2017-09-11 17:17:16

标签: python pandas numpy select dataframe

数据

我有一个包含5列的数据框:

  • 原籍纬度和经度(origin_latorigin_lng
  • 目的地的纬度和经度(dest_latdest_lng
  • 根据其他字段计算的分数

我有一个矩阵M,其中包含成对的原点和目的地纬度/经度。其中一些对存在于数据帧中,另一些则不存在。

目标

我的目标是双重的:

  1. M中选择数据框前四列中不存在的所有对,向其应用函数func(计算得分列),并将结果附加到现有数据帧。 注意:我们不应该重新计算现有行的分数。
  2. 添加缺失的行后,在新数据框M中选择由选择矩阵dfs定义的所有行。
  3. 示例代码

    # STEP 1: Generate example data
    ctr_lat = 40.676762
    ctr_lng = -73.926420
    N = 12
    N2 = 3
    
    data = np.array([ctr_lat+np.random.random((N))/10,
                     ctr_lng+np.random.random((N))/10,
                     ctr_lat+np.random.random((N))/10,
                     ctr_lng+np.random.random((N))/10]).transpose()
    
    # Example function - does not matter what it does
    def func(x):
        return np.random.random()
    
    # Create dataframe
    geocols = ['origin_lat','origin_lng','dest_lat','dest_lng']
    df = pd.DataFrame(data,columns=geocols)
    df['score'] = df.apply(func,axis=1)
    

    这给了我一个像这样的数据框df

        origin_lat  origin_lng   dest_lat   dest_lng     score
    0    40.684887  -73.924921  40.758641 -73.847438  0.820080
    1    40.703129  -73.885330  40.774341 -73.881671  0.104320
    2    40.761998  -73.898955  40.767681 -73.865001  0.564296
    3    40.736863  -73.859832  40.681693 -73.907879  0.605974
    4    40.761298  -73.853480  40.696195 -73.846205  0.779520
    5    40.712225  -73.892623  40.722372 -73.868877  0.628447
    6    40.683086  -73.846077  40.730014 -73.900831  0.320041
    7    40.726003  -73.909059  40.760083 -73.829180  0.903317
    8    40.748258  -73.839682  40.713100 -73.834253  0.457138
    9    40.761590  -73.923624  40.746552 -73.870352  0.867617
    10   40.748064  -73.913599  40.746997 -73.894851  0.836674
    11   40.771164  -73.855319  40.703426 -73.829990  0.010908
    

    然后我可以人工创建选择矩阵M,其中包含数据帧中存在的3行,而不包含3行。

    # STEP 2: Generate data to select
    # As an example, I select 3 rows that are part of the dataframe, and 3 that are not
    data2 = np.array([ctr_lat+np.random.random((N2))/10,
                      ctr_lng+np.random.random((N2))/10,
                      ctr_lat+np.random.random((N2))/10,
                      ctr_lng+np.random.random((N2))/10]).transpose()
    
    M = np.concatenate((data[4:7,:],data2))
    

    矩阵M如下所示:

    array([[ 40.7612977 , -73.85348031,  40.69619549, -73.84620489],
           [ 40.71222463, -73.8926234 ,  40.72237185, -73.86887696],
           [ 40.68308567, -73.84607722,  40.73001434, -73.90083107],
           [ 40.7588412 , -73.87128079,  40.76750639, -73.91945371],
           [ 40.74686156, -73.84804047,  40.72378653, -73.92207075],
           [ 40.6922673 , -73.88275402,  40.69708748, -73.87905543]])
    

    从这里开始,我不知道如何知道M中哪些行df不存在并添加它们。我不知道如何从df中选择M中的所有行。

    我的想法是识别缺失的行,将其附加到df得分nan,然后重新计算nan行的得分。但是,我不知道如何在没有循环矩阵M的每个元素的情况下有效地选择这些行。

    有什么建议吗? 非常感谢你的帮助!

5 个答案:

答案 0 :(得分:7)

有没有理由不使用merge

df2 = pd.DataFrame(M, columns=geocols) 
df = df.merge(df2, how='outer')
ix = df.score.isnull()
df.loc[ix, 'score'] = df.loc[ix].apply(func, axis=1)

它完全符合您的建议:使用nan分数添加缺失的行df,标识nans,计算这些行的分数。

答案 1 :(得分:2)

因此,此解决方案会循环遍历M中的每一行,但不会循环每个元素。步骤是:

  1. 浏览M中的每一行,并确定它是否在df中。如果是,请保存索引。如果不是,请计算得分并保存。
  2. 通过从上方获取新的M行并附加在df中找到的行来创建M数据帧。
  3. 只需附加新的M行
  4. 即可创建新版本的数据框

    希望这有帮助 - 我意识到它仍然有一个循环,但我还没弄明白如何摆脱它。你的问题也只是说df可能很大,你想避免循环M的元素,这至少可以通过循环行来避免。

    M_in_df = []
    M_not_in_df = []
    
    for m in M:
        df_index = (df.iloc[:,:4].values == m).all(axis=1)
        if df_index.any():
            M_in_df.append(np.argmax(df_index))
        else:
            M_not_in_df.append(np.append(m, func(m)))    
    
    M_df = pd.DataFrame(M_not_in_df, columns=df.columns).append(df.iloc[M_in_df], ignore_index=True)
    
    new_df = df.append(pd.DataFrame(M_not_in_df, columns=df.columns), ignore_index=True)
    

答案 2 :(得分:2)

M转换为DataFrame,与df结转:

df2 = pd.DataFrame(M, columns=geocols)
df3 = pd.concat([df, df2], ignore_index=True)

仅根据geocols中的cols删除重复行:

df3 = df3.drop_duplicates(subset=geocols)

获取NaN的{​​{1}}行的掩码:

score

将分数应用于蒙版行,并存储在m = df3.score.isnull()

df3

你将获得一个SettingWithCopyWarning,但它可以工作。

答案 3 :(得分:2)

您正在进行地理空间分析,我认为采用一些标准方法非常重要。也就是说,每个行/条目都由一对坐标标识,因此,将它们转换为WKT,会很有意义。

使用WKT,您需要检查的是旧数据中是否已找到新数据的WKT:

# from shapely.wkt import dumps
# import shapely.geometry as sg

In [27]: M = np.array([[ 40.761998, -73.898955, 40.767681, -73.865001],
    ...:               [ 40.736863, -73.859832, 40.681693, -73.907879],
    ...:               [ 40.761298, -73.853480, 40.696195, -73.846205],
    ...:               [ 40.7588412 , -73.87128079,  40.76750639, -73.91945371],
    ...:               [ 40.74686156, -73.84804047,  40.72378653, -73.92207075],
    ...:               [ 40.6922673 , -73.88275402,  40.69708748, -73.87905543]])
In [28]: df
Out[28]: 
    origin_lat  origin_lng   dest_lat   dest_lng     score  
0    40.684887  -73.924921  40.758641 -73.847438  0.820080   
1    40.703129  -73.885330  40.774341 -73.881671  0.104320   
2    40.761998  -73.898955  40.767681 -73.865001  0.564296   
3    40.736863  -73.859832  40.681693 -73.907879  0.605974   
4    40.761298  -73.853480  40.696195 -73.846205  0.779520   
5    40.712225  -73.892623  40.722372 -73.868877  0.628447   
6    40.683086  -73.846077  40.730014 -73.900831  0.320041   
7    40.726003  -73.909059  40.760083 -73.829180  0.903317   
8    40.748258  -73.839682  40.713100 -73.834253  0.457138   
9    40.761590  -73.923624  40.746552 -73.870352  0.867617   
10   40.748064  -73.913599  40.746997 -73.894851  0.836674   
11   40.771164  -73.855319  40.703426 -73.829990  0.010908   

# Generate WKT for the original dataframe
In [29]: df['wkt'] = df.apply(lambda x: dumps(sg.LineString([x[:2], x[2:4]]),
                                              rounding_precision=6),
                              axis=1)

In [29]: df
Out[29]: 
    origin_lat  origin_lng   dest_lat   dest_lng     score                                                 wkt
0    40.684887  -73.924921  40.758641 -73.847438  0.820080   LINESTRING (40.684887 -73.924921, 40.758641 -7...
1    40.703129  -73.885330  40.774341 -73.881671  0.104320   LINESTRING (40.703129 -73.885330, 40.774341 -7...
2    40.761998  -73.898955  40.767681 -73.865001  0.564296   LINESTRING (40.761998 -73.898955, 40.767681 -7...
3    40.736863  -73.859832  40.681693 -73.907879  0.605974   LINESTRING (40.736863 -73.859832, 40.681693 -7...
4    40.761298  -73.853480  40.696195 -73.846205  0.779520   LINESTRING (40.761298 -73.853480, 40.696195 -7...
5    40.712225  -73.892623  40.722372 -73.868877  0.628447   LINESTRING (40.712225 -73.892623, 40.722372 -7...
6    40.683086  -73.846077  40.730014 -73.900831  0.320041   LINESTRING (40.683086 -73.846077, 40.730014 -7...
7    40.726003  -73.909059  40.760083 -73.829180  0.903317   LINESTRING (40.726003 -73.909059, 40.760083 -7...
8    40.748258  -73.839682  40.713100 -73.834253  0.457138   LINESTRING (40.748258 -73.839682, 40.713100 -7...
9    40.761590  -73.923624  40.746552 -73.870352  0.867617   LINESTRING (40.761590 -73.923624, 40.746552 -7...
10   40.748064  -73.913599  40.746997 -73.894851  0.836674   LINESTRING (40.748064 -73.913599, 40.746997 -7...
11   40.771164  -73.855319  40.703426 -73.829990  0.010908   LINESTRING (40.771164 -73.855319, 40.703426 -7...

# Generate WKT for the new data
In [30]: new_wkt = [dumps(sg.LineString(r.reshape(2,2)), 
                          rounding_precision=6)
                    for r in M]
In [30]: np.isin(new_wkt, df.wkt)
Out[30]: array([ True,  True,  True, False, False, False], dtype=bool)

# Only put the WKT not found in the original dataframe into the a new dataframe
In [31]: df2 = pd.DataFrame(M[np.isin(new_wkt, df.wkt)], columns=['origin_lat', 'origin_lng', 'dest_lat', 'dest_lng'])
In [32]: df2['wkt'] = np.array(new_wkt)[np.isin(new_wkt, df.wkt)]

# Only do calculation for the new entries
In [33]: df2['score'] = 0  # or do whatever score calculation needed

# Combine the new to the old
In [34]: df.append(df2)
Out[34]: 
     dest_lat   dest_lng  origin_lat  origin_lng     score                                                wkt
0   40.758641 -73.847438   40.684887  -73.924921  0.820080  LINESTRING (40.684887 -73.924921, 40.758641 -7...
1   40.774341 -73.881671   40.703129  -73.885330  0.104320  LINESTRING (40.703129 -73.885330, 40.774341 -7...
2   40.767681 -73.865001   40.761998  -73.898955  0.564296  LINESTRING (40.761998 -73.898955, 40.767681 -7...
3   40.681693 -73.907879   40.736863  -73.859832  0.605974  LINESTRING (40.736863 -73.859832, 40.681693 -7...
4   40.696195 -73.846205   40.761298  -73.853480  0.779520  LINESTRING (40.761298 -73.853480, 40.696195 -7...
5   40.722372 -73.868877   40.712225  -73.892623  0.628447  LINESTRING (40.712225 -73.892623, 40.722372 -7...
6   40.730014 -73.900831   40.683086  -73.846077  0.320041  LINESTRING (40.683086 -73.846077, 40.730014 -7...
7   40.760083 -73.829180   40.726003  -73.909059  0.903317  LINESTRING (40.726003 -73.909059, 40.760083 -7...
8   40.713100 -73.834253   40.748258  -73.839682  0.457138  LINESTRING (40.748258 -73.839682, 40.713100 -7...
9   40.746552 -73.870352   40.761590  -73.923624  0.867617  LINESTRING (40.761590 -73.923624, 40.746552 -7...
10  40.746997 -73.894851   40.748064  -73.913599  0.836674  LINESTRING (40.748064 -73.913599, 40.746997 -7...
11  40.703426 -73.829990   40.771164  -73.855319  0.010908  LINESTRING (40.771164 -73.855319, 40.703426 -7...
0   40.767681 -73.865001   40.761998  -73.898955  0.000000  LINESTRING (40.761998 -73.898955, 40.767681 -7...
1   40.681693 -73.907879   40.736863  -73.859832  0.000000  LINESTRING (40.736863 -73.859832, 40.681693 -7...
2   40.696195 -73.846205   40.761298  -73.853480  0.000000  LINESTRING (40.761298 -73.853480, 40.696195 -7...

补充意见:

  1. 使用WKT / WKB编码的地理空间信息,可以轻松使用可用的地理空间库来计算score列(如果涉及此类计算)
  2. 为WKT设置正确的精度通常是地理空间数据的必要考虑因素(此处我将其设置为precision至6)
  3. 性能。每次从新数据框({1}}添加新行时,df(或df.append(df2)部分)的维度都会更改。本质上,这意味着如果这种'更新发生了很多事情,性能将会上升。
  4. 如果分析是围绕地理空间数据构建的,geopandas可能值得研究。

答案 4 :(得分:0)

让我们的第一个形状M作为名为df_temp的数据框:

In [1]: df_temp=pd.DataFrame(M,columns=('origin_lat','origin_lng','dest_lat','dest_lng'))

In [2]: df_temp
Out[2]:
   origin_lat  origin_lng   dest_lat   dest_lng
0   40.724872  -73.843830  40.768628 -73.875295
1   40.744625  -73.858908  40.770675 -73.915897
2   40.683664  -73.916877  40.700891 -73.904609
3   40.774582  -73.871768  40.703176 -73.833921
4   40.680940  -73.839505  40.752041 -73.882552
5   40.677105  -73.897702  40.743859 -73.883683

使用合并,我们现在可以轻松跟踪d中M:

中的元素
In [3]: dfs = df.merge(df_temp,on=['origin_lat','origin_lng','dest_lat','dest_lng'],
right_index=True)

In [4]: dfs
Out[4]:
   origin_lat  origin_lng   dest_lat   dest_lng     score
4   40.724872  -73.843830  40.768628 -73.875295  0.705182
5   40.744625  -73.858908  40.770675 -73.915897  0.724282
6   40.683664  -73.916877  40.700891 -73.904609  0.645395

注意:right_index参数允许我们保持df的索引,以便我们知道哪些df行也在M

最后,我们可以在df_temp中添加不在df中的行:

# Compute the scores of df_temp
df_temp['score'] = [func(df_temp.iloc[i]) for i in range(len(df_temp))]
# Append elements of df_temp to df
df.append(df_temp,ignore_index=True)
# Erase duplicates
df.drop_duplicates(subset=['origin_lat','origin_lng','dest_lat','dest_lng'])

注意: drop_duplicates中的子集就在这里,因为您的得分函数是非确定性的