Question

我有一个数据框，所以假设我的数据是表格格式。

|ID   |       Serial               |    Updated
-------------------------------------------------------
|10   |pers1                       |                  |
|20   |                            |                  |
|30   |entity_1, entity_2, entity_3|entity_1, entity_3|

现在使用withColumn（＆＃34; Serial＆＃34;，explode（split（＆＃34;，＆＃34;）＆＃34; Serial＆＃34;）））。我已经将列分成多行，如下所示。这是要求的第1部分。

   |ID   |       Serial    |    Updated
    -------------------------------------------------------
    |10   |pers1           |                  |
    |20   |                |                  |
    |30   |entity_1        |entity_1, entity_3|
    |30   |entity_2        |entity_1, entity_3|
    |30   |entity_3        |entity_1, entity_3|

现在对于没有值的列，它应该是0，对于＆＃39;序列＆＃39; 中的值，应在＆＃39;更新＆＃39; 列中搜索列。如果该值出现在已更新＆＃39;然后它应该显示＆＃39; 1＆＃39;别的＆＃39; 2＆＃39;

所以在这种情况下对于entity_1＆amp;＆amp; entity_3 - ＆gt; 1必须显示＆amp; for entity_2 - ＆gt;应该显示2

如何实现这个目标..？

Answer 1

AFAIK，无法使用udf直接检查是否包含一列或是否是另一列的子字符串。

但是，如果您想避免使用udf，则一种方法是展开"Updated"列。然后，您可以检查"Serial"列与展开的"Updated"列之间的相等性并应用您的条件（如果匹配则为1，否则为2） - 调用此"contains"。

最后，您可以groupBy("ID", "Serial", "Updated")并选择"contains"列的最小值。

例如，在两次调用explode()并检查您的情况后，您将拥有如下所示的DataFrame：

df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
    .withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
    .withColumn(
        "contains",
        f.when(
            f.isnull("Serial") | 
            f.isnull("Updated") | 
            (f.col("Serial") == "") | 
            (f.col("Updated") == ""),
            0
        ).when(
            f.col("Serial") == f.col("updatedExploded"),
            1
        ).otherwise(2)
    )\
    .show(truncate=False)
#+---+--------+-----------------+---------------+--------+
#|ID |Serial  |Updated          |updatedExploded|contains|
#+---+--------+-----------------+---------------+--------+
#|10 |pers1   |                 |               |0       |
#|20 |        |                 |               |0       |
#|30 |entity_1|entity_1,entity_3|entity_1       |1       |
#|30 |entity_1|entity_1,entity_3|entity_3       |2       |
#|30 |entity_2|entity_1,entity_3|entity_1       |2       |
#|30 |entity_2|entity_1,entity_3|entity_3       |2       |
#|30 |entity_3|entity_1,entity_3|entity_1       |2       |
#|30 |entity_3|entity_1,entity_3|entity_3       |1       |
#+---+--------+-----------------+---------------+--------+

＆＃34;技巧＆＃34;按("ID", "Serial", "Updated")进行分组并使"contains"的最小值起作用是因为：

如果"Serial"或"Updated"为空（或在此情况下等于空字符串），则该值将为0.
如果"Updated"中至少有一个值与"Serial"匹配，则其中一列将为1。
如果没有匹配项，则只有2个

最终输出：

df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
    .withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
    .withColumn(
        "contains",
        f.when(
            f.isnull("Serial") |
            f.isnull("Updated") |
            (f.col("Serial") == "") |
            (f.col("Updated") == ""),
            0
        ).when(
            f.col("Serial") == f.col("updatedExploded"),
            1
        ).otherwise(2)
    )\
    .groupBy("ID", "Serial", "Updated")\
    .agg(f.min("contains").alias("contains"))\
    .sort("ID")\
    .show(truncate=False)
#+---+--------+-----------------+--------+
#|ID |Serial  |Updated          |contains|
#+---+--------+-----------------+--------+
#|10 |pers1   |                 |0       |
#|20 |        |                 |0       |
#|30 |entity_3|entity_1,entity_3|1       |
#|30 |entity_2|entity_1,entity_3|2       |
#|30 |entity_1|entity_1,entity_3|1       |
#+---+--------+-----------------+--------+

我chaining calls到pyspark.sql.functions.when()检查条件。第一部分检查任一列是null还是等于空字符串。我相信您可能只需要在实际数据中检查null，但我会根据您显示示例DataFrame的方式检查空字符串。

如何在Apache Spark - Pyspark中匹配2列

1 个答案: