Pyspark:PicklingError:无法序列化对象:

时间:2017-11-12 13:21:09

标签: pyspark pickle user-defined-functions

我有以下两个数据框:df_whitelist和df_text

</script>, <script type="text/javascript">
nfl.use("node", "datatable", "datatable-sort", "mobile-panel", "overthrow", 
"overthrow-shadows", "tabview", function(Y) {
var isTeamAway      = false,
    isTeamHome      = false,
    isTeam          = false,
    homeAbbr        = 'DEN',
    awayAbbr        = 'LAC',
    gameWeek        = '1',
    teamTabHome     = Y.one('.colors-DEN-1'),
    teamTabAway     = Y.one('.colors-LAC-1'),
    datatableHome   = Y.one('.data-table-DEN-1'),
    datatableAway   = Y.one('.data-table-LAC-1');

var dataAway = [












    {player: "Inman Dontrelle ",   position: "WR", injury: "Groin", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Inman", firstName: "Dontrelle", esbId: "INM264861"  },



    {player: "McGrath Sean ",   position: "TE", injury: "Knee", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "McGrath", firstName: "Sean", esbId: "MCG631892"  },











    {player: "Attaochu Jeremiah ",   position: "DE", injury: "Hamstring", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Attaochu", firstName: "Jeremiah", esbId: "ATT290361"  },









    {player: "Boston Jayestin ",   position: "S", injury: "Calf", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Boston", firstName: "Jayestin", esbId: "BOS695248"  },


];

var dataHome = [


    {player: "Booker Devontae ",   position: "RB", injury: "Wrist", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Booker", firstName: "Devontae", esbId: "BOO019902"  },



    {player: "Talib Aqib ",   position: "CB", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Talib", firstName: "Aqib", esbId: "TAL428789"  },



    {player: "Paradis Matthew ",   position: "C", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Paradis", firstName: "Matthew", esbId: "PAR002722"  },



    {player: "Kerr Zachariah ",   position: "DT", injury: "Knee", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Kerr", firstName: "Zachariah", esbId: "KER593782"  },



    {player: "Peko Kyle ",   position: "DT", injury: "Foot", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Peko", firstName: "Kyle", esbId: "PEK467819"  },







    {player: "Dixon Riley ",   position: "P", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Dixon", firstName: "Riley", esbId: "DIX641722"  },



    {player: "Crick Jared ",   position: "DE", injury: "Back", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Crick", firstName: "Jared", esbId: "CRI129618"  },



    {player: "Wolfe Derek ",   position: "DE", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Wolfe", firstName: "Derek", esbId: "WOL309455"  },



    {player: "Lynch Paxton ",   position: "QB", injury: "right Shoulder", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Lynch", firstName: "Paxton", esbId: "LYN526034"  },





    {player: "Gotsis Adam ",   position: "DE", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Gotsis", firstName: "Adam", esbId: "GOT428790"  },



    {player: "Thomas Demaryius ",   position: "WR", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Thomas", firstName: "Demaryius", esbId: "THO095855"  },



    {player: "Charles Jamaal ",   position: "RB", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Charles", firstName: "Jamaal", esbId: "CHA561428"  },




];  

在df_whitelist中,每个关键字对应一组术语,例如关键字LA对应于“LA city”和“US LA in da”。 在df_text中,我有文本和本文中的一些关键字。 我想要做的是,对于每个文本,例如“客户端有ada ..”,对于每个关键字“client”和“ada”,检查该关键字的所有白名单术语,看看如何很多时候这个词出现在文本中。 我试过的就像是:

+-------+--------------------+
|keyword|    whitelist_terms |
+-------+--------------------+
|     LA|             LA city|
|     LA|        US LA in da |
| client|this client has i...|
| client|our client has do...|
+-------+--------------------+
+--------------------+----------+
|                Text|  Keywords|
+--------------------+----------+
|the client as ada...|client;ada|
|this client has l...| client;LA|
+--------------------+----------+

我得到了错误:

import pyspark.sql.functions as F
import pyspark.sql.types as T
import re
def whitelisting(text,listOfKeyword,df_whitelist):
    keywords = listOfKeyword.split(";")
    found_whiteterms_count = 0
    for k in keywords:
        if df_whitelist.filter(df_whitelist.keyword == k).count() == 0:
            found_whiteterms_count = found_whiteterms_count + 0
        else:
            df = df_whitelist.filter(df_whitelist.keyword == k).select("whitelist_terms")
            n = df.rdd.map(lambda x:len(re.findall(x["whitelist_terms"],text))).reduce(lambda x, y: x+y)
            found_whiteterms_count = found_whiteterms_count + n    
    return found_whiteterms_count     
whitelisting_udf = F.udf(lambda text,listOfKeyword: whitelisting(text,listOfKeyword,df_whitelist),T.IntegerType())
text.withColumn("whitelist_counts", whitelisting_udf(text.Text,text.Keywords))

经过一段时间的努力,我想不出来。任何人都可以帮助指出问题以及如何解决它。感谢。

1 个答案:

答案 0 :(得分:4)

您正在将pyspark数据框df_whitelist传递给UDF,pyspark数据框无法进行腌制。您还在UDF内的数据框上进行计算,这是不可接受的(不可能)。请记住,您的函数将被调用为数据帧中行数的次数,因此您应该保持计算简单。并且只有在无法通过pyspark sql函数完成时才能执行此操作。

您需要做的是加入keyword上的两个数据框。 让我们从您提供的两个示例数据帧开始:

df_whitelist = spark.createDataFrame(
    [["LA", "LA city"], ["LA", "US LA in da"], ["client", "this client has i"], ["client", "our client"]], 
    ["keyword", "whitelist_terms"])
df_text = spark.createDataFrame(
    [["the client as ada", "client;ada"], ["this client has l", "client;LA"]], 
    ["Text", "Keywords"])

Keywords中的df_text列需要进行一些处理,我们必须将字符串转换为数组,然后将其展开,以便每行只有一个项目:

import pyspark.sql.functions as F
df_text = df_text.select("Text", F.explode(F.split("Keywords", ";")).alias("keyword"))

    +-----------------+-------+
    |             Text|keyword|
    +-----------------+-------+
    |the client as ada| client|
    |the client as ada|    ada|
    |this client has l| client|
    |this client has l|     LA|
    +-----------------+-------+

现在我们可以加入keyword上的两个数据框:

df = df_text.join(df_whitelist, "keyword", "leftouter")

    +-------+-----------------+-----------------+
    |keyword|             Text|  whitelist_terms|
    +-------+-----------------+-----------------+
    |     LA|this client has l|          LA city|
    |     LA|this client has l|      US LA in da|
    |    ada|the client as ada|             null|
    | client|the client as ada|this client has i|
    | client|the client as ada|       our client|
    | client|this client has l|this client has i|
    | client|this client has l|       our client|
    +-------+-----------------+-----------------+
  • 您在UDF中调用的第一个条件可以翻译如下:如果keyword中的df_text不存在于df_whitelist中,则为0.相当于说df_whitelistleft join列的值将为NULL,因为它们只显示在左侧数据框中

  • 第二个条件:您计算whitelist_terms Text中出现Text.count(whitelist_terms)的次数:UDF

我们会写一个from pyspark.sql.types import IntegerType count_terms = F.udf(lambda Text, term: Text.count(term) if term is not None else 0, IntegerType()) df = df.select( "Text", "keyword", F.when(F.isnull("whitelist_terms"), 0).otherwise(count_terms("Text", "whitelist_terms")).alias("whitelist_counts")) +-----------------+-------+----------------+ | Text|keyword|whitelist_counts| +-----------------+-------+----------------+ |this client has l| LA| 0| |this client has l| LA| 0| |the client as ada| ada| 0| |the client as ada| client| 0| |the client as ada| client| 0| |this client has l| client| 0| |this client has l| client| 0| +-----------------+-------+----------------+ 来执行此操作:

Text

最后,我们可以汇总以返回仅包含不同res = df.groupBy("Text").agg( F.collect_set("keyword").alias("Keywords"), F.sum("whitelist_counts").alias("whitelist_counts")) res.show() +-----------------+-------------+----------------+ | Text| Keywords|whitelist_counts| +-----------------+-------------+----------------+ |this client has l| [client, LA]| 0| |the client as ada|[ada, client]| 0| +-----------------+-------------+----------------+ 的数据框:

CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2147483648