将字符串列表转换为pyspark中的二进制列表

时间:2019-10-09 11:48:41

标签: apache-spark pyspark apache-spark-sql pyspark-dataframes

我有一个这样的数据框

data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])), 
    (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)

+---+----------------------------+
|ID |MonthList                   |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May]         |
|ID3|[October, June]             |
+---+----------------------------+

我想将每行与默认列表进行比较,以便如果存在该值,则分配1否则为0

default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']

因此我的预期输出是这个

+---+----------------------------+------------------+
|ID |MonthList                   |Binary_MonthList  |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May]         |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June]             |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+

我能够在python中执行此操作,但是不知道如何在pyspark

中执行此操作

3 个答案:

答案 0 :(得分:4)

您可以尝试使用这样的udf

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType

default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']

def_month_list_func = udf(lambda x: [1 if i in x else 0 for i in default_month_list], ArrayType(IntegerType()))

df = df.withColumn("Binary_MonthList", def_month_list_func(col("MonthList")))

df.show()
# output
+---+--------------------+------------------+
| ID|           MonthList|  Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3|     [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+

答案 1 :(得分:3)

如何使用array_contains()

from pyspark.sql.functions import array, array_contains        

df.withColumn('Binary_MonthList', array([array_contains('MonthList', c).astype('int') for c in default_month_list])).show()                                                                                                         
+---+--------------------+------------------+
| ID|           MonthList|  Binary_MonthList|
+---+--------------------+------------------+
|ID1|[October, Septemb...|[1, 1, 1, 0, 0, 0]|
|ID2| [August, June, May]|[0, 0, 1, 0, 1, 1]|
|ID3|     [October, June]|[1, 0, 0, 0, 1, 0]|
+---+--------------------+------------------+

答案 2 :(得分:2)

pissall的答案是完全可以的。我只是发布了一个更通用的解决方案,该解决方案无需udf即可工作,并且不需要您知道可能的值。

CountVectorizer确实可以满足您的需求。只要所有特定值都满足特定条件(例如最小或最大出现次数),此算法就会将所有不同的值添加到他的字典中。您可以将此模型应用于数据框,它将返回one-hot编码的稀疏矢量列(which can be converted to a dense vector column),该列代表给定输入列的项。

from pyspark.ml.feature import CountVectorizer

data = [(("ID1", ['October', 'September', 'August']))
        , (("ID2", ['August', 'June', 'May', 'August']))
        , (("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])

df.show(truncate=False)

#binary=True checks only if a item of the dictionary is present and not how often
#vocabSize defines the maximum size of the dictionary
#minDF=1.0 defines in how much rows (1.0 means one row is enough) a values has to be present to be added to the vocabulary
cv = CountVectorizer(inputCol="MonthList", outputCol="Binary_MonthList", vocabSize=12, minDF=1.0, binary=True)

cvModel = cv.fit(df)

df = cvModel.transform(df)

df.show(truncate=False)

cvModel.vocabulary

输出:

+---+----------------------------+
|ID |                  MonthList | 
+---+----------------------------+ 
|ID1|[October, September, August]| 
|ID2| [August, June, May, August]| 
|ID3|            [October, June] | 
+---+----------------------------+ 

+---+----------------------------+-------------------------+ 
|ID |                  MonthList |        Binary_MonthList | 
+---+----------------------------+-------------------------+ 
|ID1|[October, September, August]|(5,[1,2,3],[1.0,1.0,1.0])| 
|ID2|[August, June, May, August] |(5,[0,1,4],[1.0,1.0,1.0])| 
|ID3|[October, June]             |     (5,[0,2],[1.0,1.0]) |
+---+----------------------------+-------------------------+ 

['June', 'August', 'October', 'September', 'May']