如何在两个不同列表中包含的基于数据框的值中创建新列?

时间:2019-11-19 06:20:54

标签: python-3.x pyspark apache-spark-sql pyspark-sql pyspark-dataframes

我有一个像这样的pyspark数据框:

+--------------------+--------------------+
|               label|           sentences|
+--------------------+--------------------+
|[things, we, eati...|<p>I am construct...|
|[elephants, nordi...|<p><strong>Edited...|
|[bee, cross-entro...|<p>I have a data ...|
|[milking, markers...|<p>There is an Ma...|
|[elephants, tease...|<p>I have Score d...|
|[references, gene...|<p>I'm looking fo...|
|[machines, exitin...|<p>I applied SVM ...|
+--------------------+--------------------+

还有一个top_ten列表,如下所示:

['bee', 'references', 'milking', 'expert', 'bombardier', 'borscht', 'distributions', 'wires', 'keyboard', 'correlation']

我需要创建一个new_label列,指示1.0是否在top_ten列表中存在至少一个标签值(当然是针对每一行)。

尽管逻辑合理,但我对语法的经验不足。当然,这个问题有一个简短的答案?

我尝试过:

temp = train_df.withColumn('label', F.when(lambda x: x.isin(top_ten), 1.0).otherwise(0.0))

这:

def matching_top_ten(top_ten, labels):
    for label in labels:
        if label.isin(top_ten):
            return 1.0
        else:
            return 0.0

在最后一次尝试之后,我发现这些函数无法映射到数据框。因此,我想我可以将列转换为RDD,进行映射,然后再.join()将其返回,但这听起来不必要地乏味。

**更新:**尝试将以上功能用作UDF,也没有运气...

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
matching_udf = udf(matching_top_ten, FloatType())
temp = train_df.select('label', matching_udf(top_ten, 'label').alias('new_labels'))
----
TypeError: Invalid argument, not a string or column: [...top_ten list values...] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

我在SO上还发现了其他类似的问题,但是,这些问题都不涉及根据另一个列表(最好是针对列表的单个值)来验证列表的逻辑。

3 个答案:

答案 0 :(得分:2)

您无需use a udf,就可以避免花费explode + agg

Spark 2.4+版

您可以使用pyspark.sql.functions.arrays_overlap

import pyspark.sql.functions as F

top_ten_array = F.array(*[F.lit(val) for val in top_ten])

temp = train_df.withColumn(
    'new_label', 
    F.when(F.arrays_overlap('label', top_ten_array), 1.0).otherwise(0.0)
)

或者,您应该可以使用pyspark.sql.functions.array_intersect()

temp = train_df.withColumn(
    'new_label', 
    F.when(
        F.size(F.array_intersect('label', top_ten_array)) > 0, 1.0
    ).otherwise(0.0)
)

这两项都检查labeltop_ten的交点的大小是否为非零。


对于Spark 1.5至2.3,您可以在top_ten上循环使用array_contains

from operator import or_
from functools import reduce

temp = train_df.withColumn(
    'new_label',
    F.when(
        reduce(or_, [F.array_contains('label', val) for val in top_ten]),
        1.0
    ).otherwise(0.0)
)

您要测试label是否包含top_ten中的任何值,并按位或将结果减小。如果True中的任何值包含在top_ten中,则仅返回label

答案 1 :(得分:0)

您可以为前十个列表创建一个新列,作为array,将sentence列拆分为数组中的各个单词,然后以以下方式应用udf:

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

top_ten_list = ['bee', 'references', 'milking', 'expert', 'bombardier', 'borscht', 'distributions', 'wires', 'keyboard', 'correlation']
df.withColumn("top_ten_list", F.array([F.lit(x) for x in top_ten_list]))

def matching_top_ten(normal_string, top_ten_ls):
    if len(set(normal_string).intersection(set(top_ten_ls))) > 0:
        return 1
    return 0

matching_top_ten_udf = F.udf(matching_top_ten, IntegerType())

df = df.withColumn("label_flag", matching_top_ten_udf(F.col("label"), F.col("top_ten_list")))
df = df.withColumn("split_sentence", F.split("sentence", " ")).withColumn("label_flag", matching_top_ten_udf(F.col("split_sentence"), F.col("top_ten_list")))

您可以跳过第一步,因为我看到您已经拥有top_ten_list作为label

使用我使用的df采样输出(与您的模式没有相同的模式):

  customer  Month  year  spend        ls1                    sentence                      sentence_split  label
0        a     11  2018   -800  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
1        a     12  2018   -800  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
2        a      1  2019    300  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
3        a      2  2019    150  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
4        a      3  2019    300  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
5        a      4  2019   -500  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
6        a      5  2019   -800  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
7        a      6  2019    600  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
8        a      7  2019   -400  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1
9        a      8  2019   -800  [new, me]  This is a new thing for me  [This, is, a, new, thing, for, me]      1

答案 2 :(得分:0)

您可以爆炸标签列,并将数据框与从列表中创建的数据框连接起来,以避免使用效率低的UDF:

from pyspark.sql.functions import monotonicallyIncreasingId, explode, col

# creating id to group edxploded columns later
train_df = train_df.withColumn("id", monotonicallyIncreasingId())

# Exploding column
train_df = train_df.withColumn("label", explode("label"))

# Creation of dataframe with the top ten list
top_df = sqlContext.createDataFrame(
    [('bee', 'references', 'milking', 'expert', 'bombardier', 'borscht', 'distributions', 'wires', 'keyboard', 'correlation',)], ['top']
)

# Join to keep elements
train_df = train_df.join(top_df, col("label") == col("top"), "left")

# Replace nulls with 0s or 1s
train_df = train_df.withColumn("top", when(col("top").isNull(),0).otherwise(1))

# Group results
train_df = train_df.groupby("id").agg(collect_list("label").alias("label"), first("sentences").alias("sentences"), sum("top").alias("new_label"))

# drop id and transform label column to be 1 or 0
train_df = train_df.withColumn("new_label", when(col("new_label")>0,1).otherwise(0))
train_df = train_df.drop("id")