我对AWS EMR和apache spark完全陌生。我正在尝试使用shapefile将GeoID分配给住宅属性。我无法从s3存储桶中读取shapefile。请帮助我了解发生了什么,因为我在互联网上找不到任何可以解释确切问题的答案。
<!-- language: python 3.4 -->
import shapefile
import pandas as pd
def read_shapefile(shp_path):
"""
Read a shapefile into a Pandas dataframe with a 'coords' column holding
the geometry information. This uses the pyshp package
"""
#read file, parse out the records and shapes
sf = shapefile.Reader(shp_path)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10")
The error that I'm getting while reading from the bucket
我真的很想在AWS EMR集群中读取这些shapefile,因为我不可能在本地单独处理它们。任何帮助都将受到赞赏。
答案 0 :(得分:1)
我一开始就可以从s3存储桶中将形状文件作为二进制对象读取,然后围绕它构建包装函数,最后将单个文件对象解析为.dbf,.shp中的shapefile.reader()方法。 .shx格式分别。
之所以发生这种情况,是因为PySpark无法读取SparkContext中未提供的格式。发现此链接很有帮助Using pyshp to read a file-like object from a zipped archive。
我的解决方案
def read_shapefile(shp_path):
import io
import shapefile
blocks = sc.binaryFiles(shp_path)
block_dict = dict(blocks.collect())
sf = shapefile.Reader(shp=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shp")][0]]),
shx=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shx")][0]]),
dbf=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".dbf")][0]]))
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
block_shapes = read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10*")
这正常工作而不会中断。