我有以下两个数据框:df_whitelist和df_text
</script>, <script type="text/javascript">
nfl.use("node", "datatable", "datatable-sort", "mobile-panel", "overthrow",
"overthrow-shadows", "tabview", function(Y) {
var isTeamAway = false,
isTeamHome = false,
isTeam = false,
homeAbbr = 'DEN',
awayAbbr = 'LAC',
gameWeek = '1',
teamTabHome = Y.one('.colors-DEN-1'),
teamTabAway = Y.one('.colors-LAC-1'),
datatableHome = Y.one('.data-table-DEN-1'),
datatableAway = Y.one('.data-table-LAC-1');
var dataAway = [
{player: "Inman Dontrelle ", position: "WR", injury: "Groin", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Inman", firstName: "Dontrelle", esbId: "INM264861" },
{player: "McGrath Sean ", position: "TE", injury: "Knee", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "McGrath", firstName: "Sean", esbId: "MCG631892" },
{player: "Attaochu Jeremiah ", position: "DE", injury: "Hamstring", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Attaochu", firstName: "Jeremiah", esbId: "ATT290361" },
{player: "Boston Jayestin ", position: "S", injury: "Calf", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Boston", firstName: "Jayestin", esbId: "BOS695248" },
];
var dataHome = [
{player: "Booker Devontae ", position: "RB", injury: "Wrist", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Booker", firstName: "Devontae", esbId: "BOO019902" },
{player: "Talib Aqib ", position: "CB", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Talib", firstName: "Aqib", esbId: "TAL428789" },
{player: "Paradis Matthew ", position: "C", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Paradis", firstName: "Matthew", esbId: "PAR002722" },
{player: "Kerr Zachariah ", position: "DT", injury: "Knee", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Kerr", firstName: "Zachariah", esbId: "KER593782" },
{player: "Peko Kyle ", position: "DT", injury: "Foot", practiceStatus: "Limited Participation in Practice", gameStatus: "Questionable", lastName: "Peko", firstName: "Kyle", esbId: "PEK467819" },
{player: "Dixon Riley ", position: "P", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Dixon", firstName: "Riley", esbId: "DIX641722" },
{player: "Crick Jared ", position: "DE", injury: "Back", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Crick", firstName: "Jared", esbId: "CRI129618" },
{player: "Wolfe Derek ", position: "DE", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Wolfe", firstName: "Derek", esbId: "WOL309455" },
{player: "Lynch Paxton ", position: "QB", injury: "right Shoulder", practiceStatus: "Did Not Participate In Practice", gameStatus: "Out", lastName: "Lynch", firstName: "Paxton", esbId: "LYN526034" },
{player: "Gotsis Adam ", position: "DE", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Gotsis", firstName: "Adam", esbId: "GOT428790" },
{player: "Thomas Demaryius ", position: "WR", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Thomas", firstName: "Demaryius", esbId: "THO095855" },
{player: "Charles Jamaal ", position: "RB", injury: "--", practiceStatus: "Full Participation in Practice", gameStatus: "--", lastName: "Charles", firstName: "Jamaal", esbId: "CHA561428" },
];
在df_whitelist中,每个关键字对应一组术语,例如关键字LA对应于“LA city”和“US LA in da”。 在df_text中,我有文本和本文中的一些关键字。 我想要做的是,对于每个文本,例如“客户端有ada ..”,对于每个关键字“client”和“ada”,检查该关键字的所有白名单术语,看看如何很多时候这个词出现在文本中。 我试过的就像是:
+-------+--------------------+
|keyword| whitelist_terms |
+-------+--------------------+
| LA| LA city|
| LA| US LA in da |
| client|this client has i...|
| client|our client has do...|
+-------+--------------------+
+--------------------+----------+
| Text| Keywords|
+--------------------+----------+
|the client as ada...|client;ada|
|this client has l...| client;LA|
+--------------------+----------+
我得到了错误:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import re
def whitelisting(text,listOfKeyword,df_whitelist):
keywords = listOfKeyword.split(";")
found_whiteterms_count = 0
for k in keywords:
if df_whitelist.filter(df_whitelist.keyword == k).count() == 0:
found_whiteterms_count = found_whiteterms_count + 0
else:
df = df_whitelist.filter(df_whitelist.keyword == k).select("whitelist_terms")
n = df.rdd.map(lambda x:len(re.findall(x["whitelist_terms"],text))).reduce(lambda x, y: x+y)
found_whiteterms_count = found_whiteterms_count + n
return found_whiteterms_count
whitelisting_udf = F.udf(lambda text,listOfKeyword: whitelisting(text,listOfKeyword,df_whitelist),T.IntegerType())
text.withColumn("whitelist_counts", whitelisting_udf(text.Text,text.Keywords))
经过一段时间的努力,我想不出来。任何人都可以帮助指出问题以及如何解决它。感谢。
答案 0 :(得分:4)
您正在将pyspark数据框df_whitelist
传递给UDF
,pyspark数据框无法进行腌制。您还在UDF
内的数据框上进行计算,这是不可接受的(不可能)。请记住,您的函数将被调用为数据帧中行数的次数,因此您应该保持计算简单。并且只有在无法通过pyspark sql函数完成时才能执行此操作。
您需要做的是加入keyword
上的两个数据框。
让我们从您提供的两个示例数据帧开始:
df_whitelist = spark.createDataFrame(
[["LA", "LA city"], ["LA", "US LA in da"], ["client", "this client has i"], ["client", "our client"]],
["keyword", "whitelist_terms"])
df_text = spark.createDataFrame(
[["the client as ada", "client;ada"], ["this client has l", "client;LA"]],
["Text", "Keywords"])
Keywords
中的df_text
列需要进行一些处理,我们必须将字符串转换为数组,然后将其展开,以便每行只有一个项目:
import pyspark.sql.functions as F
df_text = df_text.select("Text", F.explode(F.split("Keywords", ";")).alias("keyword"))
+-----------------+-------+
| Text|keyword|
+-----------------+-------+
|the client as ada| client|
|the client as ada| ada|
|this client has l| client|
|this client has l| LA|
+-----------------+-------+
现在我们可以加入keyword
上的两个数据框:
df = df_text.join(df_whitelist, "keyword", "leftouter")
+-------+-----------------+-----------------+
|keyword| Text| whitelist_terms|
+-------+-----------------+-----------------+
| LA|this client has l| LA city|
| LA|this client has l| US LA in da|
| ada|the client as ada| null|
| client|the client as ada|this client has i|
| client|the client as ada| our client|
| client|this client has l|this client has i|
| client|this client has l| our client|
+-------+-----------------+-----------------+
您在UDF
中调用的第一个条件可以翻译如下:如果keyword
中的df_text
不存在于df_whitelist
中,则为0.相当于说df_whitelist
中left join
列的值将为NULL,因为它们只显示在左侧数据框中
第二个条件:您计算whitelist_terms
Text
中出现Text.count(whitelist_terms)
的次数:UDF
我们会写一个from pyspark.sql.types import IntegerType
count_terms = F.udf(lambda Text, term: Text.count(term) if term is not None else 0, IntegerType())
df = df.select(
"Text",
"keyword",
F.when(F.isnull("whitelist_terms"), 0).otherwise(count_terms("Text", "whitelist_terms")).alias("whitelist_counts"))
+-----------------+-------+----------------+
| Text|keyword|whitelist_counts|
+-----------------+-------+----------------+
|this client has l| LA| 0|
|this client has l| LA| 0|
|the client as ada| ada| 0|
|the client as ada| client| 0|
|the client as ada| client| 0|
|this client has l| client| 0|
|this client has l| client| 0|
+-----------------+-------+----------------+
来执行此操作:
Text
最后,我们可以汇总以返回仅包含不同res = df.groupBy("Text").agg(
F.collect_set("keyword").alias("Keywords"),
F.sum("whitelist_counts").alias("whitelist_counts"))
res.show()
+-----------------+-------------+----------------+
| Text| Keywords|whitelist_counts|
+-----------------+-------------+----------------+
|this client has l| [client, LA]| 0|
|the client as ada|[ada, client]| 0|
+-----------------+-------------+----------------+
的数据框:
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2147483648