我试图做一些快速反常的地理编码。
我有数据帧poi
(大约50,000行),其中每个兴趣点都有一个lat / lng坐标。
我还有数据帧postcode_existing
(大约180,000行),它将lat / lng坐标映射到邮政编码。
我提取了相关的坐标列,并使用cKDTree为poi
中的每个兴趣点确定postcode_existing
中最近的纬度/经度坐标。
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
# read poi and postcode csv files
# Extract subset
postcode_existing_coordinates = postcode_existing[['Latitude', 'Longitude']]
# Extract subset
poi_coordinates = poi[['Latitude', 'Longitude']]
# Construct tree
tree = cKDTree(postcode_existing_coordinates)
# Query
distances, indices = tree.query(poi_coordinates)
我最终得到了相关指数。我现在希望使用这些索引从数据框postcode_existing
中选择行。
我尝试了postcode_existing.ix[indices]
,但这似乎没有得到正确的行。
例如:
>>> postcode_existing.ix[indices].head()
Postcode Latitude Longitude Easting Northing GridRef \
78579 HA3 0NS 51.57553 -0.304296 517605.0 187658.0 TQ176876
178499 NaN NaN NaN NaN NaN NaN
62392 NaN NaN NaN NaN NaN NaN
78662 HA3 0TA 51.58409 -0.288764 518659.0 188635.0 TQ186886
79470 NaN NaN NaN NaN NaN NaN
County District Ward DistrictCode ... Terminated \
78579 Greater London Brent Kenton E09000005 ... NaN
178499 NaN NaN NaN NaN ... NaN
62392 NaN NaN NaN NaN ... NaN
78662 Greater London Brent Kenton E09000005 ... NaN
79470 NaN NaN NaN NaN ... NaN
Parish NationalPark Population Households Built up area \
78579 NaN NaN 72.0 25.0 Greater London
178499 NaN NaN NaN NaN NaN
62392 NaN NaN NaN NaN NaN
78662 NaN NaN 152.0 39.0 Greater London
79470 NaN NaN NaN NaN NaN
Built up sub-division Lower layer super output area \
78579 Brent Brent 004D
178499 NaN NaN
62392 NaN NaN
78662 Brent Brent 003E
79470 NaN NaN
Rural/urban Region
78579 Urban major conurbation London
178499 NaN NaN
62392 NaN NaN
78662 Urban major conurbation London
79470 NaN NaN
[5 rows x 25 columns]
可是:
>>> postcode_existing.iloc[78579]
Postcode NW1 3AU
Latitude 51.5237
Longitude -0.143188
Easting 528915
Northing 182163
GridRef TQ289821
County Greater London
District Westminster
Ward Marylebone High Street
DistrictCode E09000033
WardCode E05000641
Country England
CountyCode E11000009
Constituency Cities of London and Westminster
Introduced 1980-01-01
Terminated NaN
Parish NaN
NationalPark NaN
Population 7
Households 1
Built up area Greater London
Built up sub-division City of Westminster
Lower layer super output area Westminster 013A
Rural/urban Urban major conurbation
Region London
Name: 133733, dtype: object
此外:
>>> postcode_existing.iloc[178499]
Postcode WC1E 6JL
Latitude 51.5236
Longitude -0.135522
Easting 529447
Northing 182168
GridRef TQ294821
County Greater London
District Camden
Ward Bloomsbury
DistrictCode E09000007
WardCode E05000129
Country England
CountyCode E11000009
Constituency Holborn and St Pancras
Introduced 1980-01-01
Terminated NaN
Parish NaN
NationalPark NaN
Population 1
Households 1
Built up area Greater London
Built up sub-division Camden
Lower layer super output area Camden 026D
Rural/urban Urban major conurbation
Region London
Name: 307029, dtype: object
这些似乎是正确的。
为什么postcode_existing.ix[indices]
没有选择正确的行?我该怎么用呢?
答案 0 :(得分:0)
问题是您在索引中使用整数。当大熊猫试图跟踪基于列表的位置以及标签时,这会让事情变得混乱。 ix
试图解决这个问题。它将indices
解释为列表位置。在这种情况下,请使用loc
DataFrame.ix 主要基于标签位置的索引器,具有整数位置回退。
.ix []支持基于混合整数和标签的访问。它主要基于标签,但将回退到整数位置访问,除非相应的轴是整数类型。
.ix是最常用的索引器,它将支持.loc和.iloc中的任何输入。 .ix还支持浮点标签方案。 .ix在处理基于混合位置和标签的分层索引时非常有用。
但是,当轴基于整数时,仅支持基于标签的访问而非位置访问。因此,在这种情况下,通常最好是明确并使用
.iloc
或.loc
。
答案 1 :(得分:0)
我解决了这个问题。问题是由于删除了某些行,数据框中的位置与索引之间不匹配。
要解决此问题,我只需重置索引:
postcode_existing.reset_index(inplace=True, drop=True)
然后我可以使用loc
来提取相关的行:
postcode_existing.loc[indices]