从pandas数据帧中选择基于cKDTree索引的行

时间:2016-05-23 23:21:33

标签: python pandas

我试图做一些快速反常的地理编码。

我有数据帧poi(大约50,000行),其中每个兴趣点都有一个lat / lng坐标。

我还有数据帧postcode_existing(大约180,000行),它将lat / lng坐标映射到邮政编码。

我提取了相关的坐标列,并使用cKDTree为poi中的每个兴趣点确定postcode_existing中最近的纬度/经度坐标。

import pandas as pd
import numpy as np
from scipy.spatial import cKDTree

# read poi and postcode csv files

# Extract subset
postcode_existing_coordinates = postcode_existing[['Latitude', 'Longitude']]

# Extract subset
poi_coordinates = poi[['Latitude', 'Longitude']]

# Construct tree
tree = cKDTree(postcode_existing_coordinates)

# Query
distances, indices = tree.query(poi_coordinates)

我最终得到了相关指数。我现在希望使用这些索引从数据框postcode_existing中选择行。

我尝试了postcode_existing.ix[indices],但这似乎没有得到正确的行。

例如:

>>> postcode_existing.ix[indices].head()
       Postcode  Latitude  Longitude   Easting  Northing   GridRef  \
78579   HA3 0NS  51.57553  -0.304296  517605.0  187658.0  TQ176876   
178499      NaN       NaN        NaN       NaN       NaN       NaN   
62392       NaN       NaN        NaN       NaN       NaN       NaN   
78662   HA3 0TA  51.58409  -0.288764  518659.0  188635.0  TQ186886   
79470       NaN       NaN        NaN       NaN       NaN       NaN   

                County District    Ward DistrictCode   ...   Terminated  \
78579   Greater London    Brent  Kenton    E09000005   ...          NaN   
178499             NaN      NaN     NaN          NaN   ...          NaN   
62392              NaN      NaN     NaN          NaN   ...          NaN   
78662   Greater London    Brent  Kenton    E09000005   ...          NaN   
79470              NaN      NaN     NaN          NaN   ...          NaN   

       Parish NationalPark Population Households   Built up area  \
78579     NaN          NaN       72.0       25.0  Greater London   
178499    NaN          NaN        NaN        NaN             NaN   
62392     NaN          NaN        NaN        NaN             NaN   
78662     NaN          NaN      152.0       39.0  Greater London   
79470     NaN          NaN        NaN        NaN             NaN   

       Built up sub-division  Lower layer super output area  \
78579                  Brent                     Brent 004D   
178499                   NaN                            NaN   
62392                    NaN                            NaN   
78662                  Brent                     Brent 003E   
79470                    NaN                            NaN   

                    Rural/urban  Region  
78579   Urban major conurbation  London  
178499                      NaN     NaN  
62392                       NaN     NaN  
78662   Urban major conurbation  London  
79470                       NaN     NaN  

[5 rows x 25 columns]

可是:

>>> postcode_existing.iloc[78579]
Postcode                                                  NW1 3AU
Latitude                                                  51.5237
Longitude                                               -0.143188
Easting                                                    528915
Northing                                                   182163
GridRef                                                  TQ289821
County                                             Greater London
District                                              Westminster
Ward                                       Marylebone High Street
DistrictCode                                            E09000033
WardCode                                                E05000641
Country                                                   England
CountyCode                                              E11000009
Constituency                     Cities of London and Westminster
Introduced                                             1980-01-01
Terminated                                                    NaN
Parish                                                        NaN
NationalPark                                                  NaN
Population                                                      7
Households                                                      1
Built up area                                      Greater London
Built up sub-division                         City of Westminster
Lower layer super output area                    Westminster 013A
Rural/urban                               Urban major conurbation
Region                                                     London
Name: 133733, dtype: object

此外:

>>> postcode_existing.iloc[178499]
Postcode                                        WC1E 6JL
Latitude                                         51.5236
Longitude                                      -0.135522
Easting                                           529447
Northing                                          182168
GridRef                                         TQ294821
County                                    Greater London
District                                          Camden
Ward                                          Bloomsbury
DistrictCode                                   E09000007
WardCode                                       E05000129
Country                                          England
CountyCode                                     E11000009
Constituency                      Holborn and St Pancras
Introduced                                    1980-01-01
Terminated                                           NaN
Parish                                               NaN
NationalPark                                         NaN
Population                                             1
Households                                             1
Built up area                             Greater London
Built up sub-division                             Camden
Lower layer super output area                Camden 026D
Rural/urban                      Urban major conurbation
Region                                            London
Name: 307029, dtype: object

这些似乎是正确的。

为什么postcode_existing.ix[indices]没有选择正确的行?我该怎么用呢?

2 个答案:

答案 0 :(得分:0)

问题是您在索引中使用整数。当大熊猫试图跟踪基于列表的位置以及标签时,这会让事情变得混乱。 ix试图解决这个问题。它将indices解释为列表位置。在这种情况下,请使用loc

Documentation

  

DataFrame.ix   主要基于标签位置的索引器,具有整数位置回退。

     

.ix []支持基于混合整数和标签的访问。它主要基于标签,但将回退到整数位置访问,除非相应的轴是整数类型。

     

.ix是最常用的索引器,它将支持.loc和.iloc中的任何输入。 .ix还支持浮点标签方案。 .ix在处理基于混合位置和标签的分层索引时非常有用。

     

但是,当轴基于整数时,仅支持基于标签的访问而非位置访问。因此,在这种情况下,通常最好是明确并使用.iloc.loc

答案 1 :(得分:0)

我解决了这个问题。问题是由于删除了某些行,数据框中的位置与索引之间不匹配。

要解决此问题,我只需重置索引:

postcode_existing.reset_index(inplace=True, drop=True)

然后我可以使用loc来提取相关的行:

postcode_existing.loc[indices]