Question

我发现许多解决方案与join情况有关。我的问题是，如果数据框本身存在重复项，如何检测和删除它们？以下示例仅显示如何创建具有重复列的数据框。

df = spark.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
], ["ID", "TYPE", "CODE"])

df1 = df.withColumn("TYPE1", df["TYPE"]).withColumn("TYPE2", df["TYPE"])

+---+----+----+
| ID|TYPE|CODE|
+---+----+----+
|  1|   A|  X1|
|  2|   B|  X2|
|  3|   B|  X3|
+---+----+----+

+---+----+----+-----+-----+
| ID|TYPE|CODE|TYPE1|TYPE2|
+---+----+----+-----+-----+
|  1|   A|  X1|    A|    A|
|  2|   B|  X2|    B|    B|
|  3|   B|  X3|    B|    B|
+---+----+----+-----+-----+

假设我刚收到df1，如何删除重复列以获取df？谢谢！

Answer 1

您可以通过比较可能相同的列的所有唯一排列来删除重复列。您可以使用itertools库和combinations来计算这些独特的排列：

from itertools import combinations
#select columns that can be identical, can also be a hardcoded list
L = filter(lambda x: 'TYPE' in x,df1.columns) 
#we only want to do pairwise comparisons, so the second value of combinations is 2
permutations = [(map(str, comb)) for comb in combinations(L, 2)]

对于这些唯一排列中的每一种，您可以使用filter语句与count完全相同。

columns_to_drop = set()
for permutation in permutations:
    if df1.filter(df1[permutation[0]] != df1[permutation[1]]).count()==0:
        columns_to_drop.add(permutation[1])

这将为您提供要删除的列的列表。然后，您可以使用以下列表推导来删除这些重复的列。

df.select([c for c in df.columns if c not in columns_to_drop]).show()

对于您的示例，这将提供以下输出：

+---+----+----+
| ID|TYPE|CODE|
+---+----+----+
|  1|   A|  X1|
|  2|   B|  X2|
|  3|   B|  X3|
+---+----+----+

Pyspark删除数据框中的重复列

1 个答案: