Question

我使用带有spark 1.5.1的python 2.7，我得到了这个：

df = sqlContext.read.parquet(".....").cache()
df = df.filter(df.foo == 1).select("a","b","c")
def myfun (row):
    return pyspark.sql.Row(....)
rdd = df.map(myfun).cache()
rdd.first()
==> UnpicklingError: NEWOBJ class argument has NULL tp_new

出了什么问题？

Answer 1

像往常一样，酸洗错误归结为myfun关闭不可打击的对象。

像往常一样，解决方案是使用mapPartitions：

import pygeoip
def get_geo (rows):
    db = pygeoip.GeoIP("/usr/share/GeoIP/GeoIPCity.dat")
    for row in rows:
        d = row.asDict()
        d["new"] = db.record_by_addr(row.client_ip) if row.client_ip else "noIP"
        yield d
rdd.mapPartitions(get_geo)

而不是map：

import pygeoip
db = pygeoip.GeoIP("/usr/share/GeoIP/GeoIPCity.dat")
def get_geo (row):
    d = row.asDict()
    d["new"] = db.record_by_addr(row.client_ip) if row.client_ip else "noIP"
    return d
rdd.map(get_geo)

Answer 2

我不确定你想做什么，但也许：

rdd =  df.rdd.cache()
rdd.first()

.rdd将DataFrame转换为rdd

PySpark.RDD.first - ＆gt; UnpicklingError：NEWOBJ类参数有NULL tp_new

2 个答案: