我使用带有spark 1.5.1的python 2.7,我得到了这个:
df = sqlContext.read.parquet(".....").cache()
df = df.filter(df.foo == 1).select("a","b","c")
def myfun (row):
return pyspark.sql.Row(....)
rdd = df.map(myfun).cache()
rdd.first()
==> UnpicklingError: NEWOBJ class argument has NULL tp_new
出了什么问题?
答案 0 :(得分:3)
myfun
关闭不可打击的对象。
像往常一样,解决方案是使用mapPartitions
:
import pygeoip
def get_geo (rows):
db = pygeoip.GeoIP("/usr/share/GeoIP/GeoIPCity.dat")
for row in rows:
d = row.asDict()
d["new"] = db.record_by_addr(row.client_ip) if row.client_ip else "noIP"
yield d
rdd.mapPartitions(get_geo)
而不是map
:
import pygeoip
db = pygeoip.GeoIP("/usr/share/GeoIP/GeoIPCity.dat")
def get_geo (row):
d = row.asDict()
d["new"] = db.record_by_addr(row.client_ip) if row.client_ip else "noIP"
return d
rdd.map(get_geo)
答案 1 :(得分:0)
我不确定你想做什么,但也许:
rdd = df.rdd.cache()
rdd.first()
.rdd将DataFrame转换为rdd