我有一个Pandas的数据帧(10亿条记录),需要从另一个数据帧中查找位置信息。这种方法有效,但我想知道是否有更好的方法来执行此操作。
首先,创建地理数据框
import pandas as pd
import shapefile
from matplotlib import path
#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]
其次,创建一个包含我的数据的数据框。它实际上有10亿条记录。
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
第三,查找地理信息。
有更高效/更清晰的方式来写这个吗?我觉得可能有一种熊猫的方法来避免iterrows()
。
def get_location(row):
for _, g in geo.iterrows():
match = g.Path.contains_point(row['latlon'])
if match:
return g[['City', 'Name']]
df.join(df.apply(get_location, axis=1))
答案 0 :(得分:1)
这个答案避免了iterrows方法(因此更快),但它仍然使用apply(axis = 1),这不是很好,特别是当你估计十亿行时。此外,我正在使用geopandas并在这里塑造
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geopandas有read_file(),非常适合形状文件
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
使用Shapely处理点
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
df['point'] = [Point(xy) for xy in df['latlon']]
使用geopandas contains()和一些布尔索引。注意:您可能需要输入一些逻辑来处理“不匹配”的情况
def get_location(row):
return pd.Series(geo[geo.contains(row['point'])][['City', 'Name']].values[0])
df.join(df.apply(get_location, axis=1))
答案 1 :(得分:1)
OP,E.K。,发现了一个名为sjoin
的漂亮的地理信息功能import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
阅读形状文件
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
将我们的pandas数据帧转换为geopandas数据帧。注意:我们使用相同的坐标参照系(CRS)作为形状文件。这对于我们将两个帧连接在一起是必要的
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']]
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)
print (geo.crs, gdf.crs)
>> {'init': 'epsg:4269'} {'init': 'epsg:4269'}
现在使用'加入'即gdf中的哪些点在geo的多边形内
gpd.tools.sjoin(gdf, geo, how='left', op='within')
一些时间记录:
OP的解决方案
import pandas as pd
import shapefile
from matplotlib import path
#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
def get_location(row):
for _, g in geo.iterrows():
match = g.Path.contains_point(row['latlon'])
if match:
return g[['City', 'Name']]
%timeit df.join(df.apply(get_location, axis=1))
>> 10 loops, best of 3: 91.1 ms per loop
我的第一个回答是使用geopandas,apply()和布尔索引
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
df['geometry'] = [Point(xy) for xy in df['latlon']]
def get_location(row):
return pd.Series(geo[geo.contains(row['geometry'])][['City', 'Name']].values[0])
%timeit df.join(df.apply(get_location, axis=1))
>> 100 loops, best of 3: 15.3 ms per loop
使用sjoin
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']]
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)
%timeit gpd.tools.sjoin(gdf, geo, how='left', op='within')
>> 10 loops, best of 3: 53.3 ms per loop
虽然sjoin不是最快的,但它可能是最好的(处理没有匹配,连接类型和操作中的更多功能)