在Spark中加入DF后删除重复列

时间:2017-10-26 01:33:40

标签: python pyspark

当你加入两个具有相似列名的DF时:

df = df1.join(df2, df1['id'] == df2['id'])

加入工作正常,但您无法调用id列,因为它不明确,您将获得以下异常:

pyspark.sql.utils.AnalysisException: "Reference 'id' is ambiguous, could be: id#5691, id#5918.;"

这使id不再可用......

以下功能解决了这个问题:

def join(df1, df2, cond, how='left'):
    df = df1.join(df2, cond, how=how)
    repeated_columns = [c for c in df1.columns if c in df2.columns]
    for col in repeated_columns:
        df = df.drop(df2[col])
    return df

我不喜欢的是我必须迭代列名并将其删除为一个。这看起来很笨重......

您是否知道任何其他解决方案可以更优雅地加入和删除重复项,还是删除多个列而不迭代每个列?

7 个答案:

答案 0 :(得分:9)

如果两个数据框的连接列具有相同的名称并且您只需要equi连接,则可以将连接列指定为列表,在这种情况下,结果将只保留其中一个连接列:

df1.show()
+---+----+
| id|val1|
+---+----+
|  1|   2|
|  2|   3|
|  4|   4|
|  5|   5|
+---+----+

df2.show()
+---+----+
| id|val2|
+---+----+
|  1|   2|
|  1|   3|
|  2|   4|
|  3|   5|
+---+----+

df1.join(df2, ['id']).show()
+---+----+----+
| id|val1|val2|
+---+----+----+
|  1|   2|   2|
|  1|   2|   3|
|  2|   3|   4|
+---+----+----+

否则您需要提供连接数据框别名并稍后通过别名引用重复的列:

df1.alias("a").join(
    df2.alias("b"), df1['id'] == df2['id']
).select("a.id", "a.val1", "b.val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
|  1|   2|   2|
|  1|   2|   3|
|  2|   3|   4|
+---+----+----+

答案 1 :(得分:7)

df.join(other, on, how),当on是列名字符串或列名字符串列表时,返回的数据框将防止重复列。 当on是连接表达式时,它将导致重复的列。我们可以使用.drop(df.a)删除重复的列。示例:

cond = [df.a == other.a, df.b == other.bb, df.c == other.ccc]
# result will have duplicate column a
result = df.join(other, cond, 'inner').drop(df.a)

答案 2 :(得分:2)

假设“ a”是具有“ id”列的数据框,而“ b”是具有“ id”列的另一个数据框

我使用以下两种方法删除重复项:

方法1:使用字符串连接表达式而不是布尔表达式。这会自动为您删除重复的列

a.join(b, 'id')

方法2:在连接之前重命名该列,在连接之后将其删除

b.withColumnRenamed('id', 'b_id')
joinexpr = a['id'] == b['b_id']
a.join(b, joinexpr).drop('b_id)

答案 3 :(得分:1)

以下代码适用于Spark 1.6.0及更高版本。

salespeople_df.show()
+---+------+-----+
|Num|  Name|Store|
+---+------+-----+
|  1| Henry|  100|
|  2| Karen|  100|
|  3|  Paul|  101|
|  4| Jimmy|  102|
|  5|Janice|  103|
+---+------+-----+

storeaddress_df.show()
+-----+--------------------+
|Store|             Address|
+-----+--------------------+
|  100|    64 E Illinos Ave|
|  101|         74 Grand Pl|
|  102|          2298 Hwy 7|
|  103|No address available|
+-----+--------------------+

假设 - 在此示例中 - 共享列的名称相同:

joined=salespeople_df.join(storeaddress_df, ['Store'])
joined.orderBy('Num', ascending=True).show()

+-----+---+------+--------------------+
|Store|Num|  Name|             Address|
+-----+---+------+--------------------+
|  100|  1| Henry|    64 E Illinos Ave|
|  100|  2| Karen|    64 E Illinos Ave|
|  101|  3|  Paul|         74 Grand Pl|
|  102|  4| Jimmy|          2298 Hwy 7|
|  103|  5|Janice|No address available|
+-----+---+------+--------------------+

.join将阻止共享列的重复。

我们假设您要删除此示例中的列Num,您只需使用.drop('colname')

joined=joined.drop('Num')
joined.show()

+-----+------+--------------------+
|Store|  Name|             Address|
+-----+------+--------------------+
|  103|Janice|No address available|
|  100| Henry|    64 E Illinos Ave|
|  100| Karen|    64 E Illinos Ave|
|  101|  Paul|         74 Grand Pl|
|  102| Jimmy|          2298 Hwy 7|
+-----+------+--------------------+

答案 4 :(得分:1)

在我的情况下,加入后我的数据框具有多个重复的列,并且我尝试使用csv格式的该数据框,但是由于重复的列,我遇到了错误。我按照以下步骤删除重复的列。代码在scala中

1) Rename all the duplicate columns and make new dataframe 2) make separate list for all the renamed columns 3) Make new dataframe with all columns (including renamed - step 1) 4) drop all the renamed column

private def removeDuplicateColumns(dataFrame:DataFrame): DataFrame = {
var allColumns:  mutable.MutableList[String] = mutable.MutableList()
val dup_Columns: mutable.MutableList[String] = mutable.MutableList()
dataFrame.columns.foreach((i: String) =>{
if(allColumns.contains(i))

if(allColumns.contains(i))
{allColumns += "dup_" + i
dup_Columns += "dup_" +i
}else{
allColumns += i
}println(i)
})
val columnSeq = allColumns.toSeq
val df = dataFrame.toDF(columnSeq:_*)
val unDF = df.drop(dup_Columns:_*)
unDF
}

to call the above function use below code and pass your dataframe which contains duplicate columns

val uniColDF = removeDuplicateColumns(df)

答案 5 :(得分:0)

将多个表连接在一起后,如果它在从左向右移动时遇到重复,则通过一个简单的函数将它们运行到DF中以删除列。或者,you could rename these columns too

如果#include < iostream > using namespace std; int main() { int number{}, i=1, sum = 0; while ((number >= 0) && (number <= 1000)) //valid option { cout << "Enter a positive integer between 0 and 1000: "; cin >> number; cout << endl; if (!(cin) || (number <= 0) || (number >= 1000)) { cout << "The number you have entered is invalid\n" << endl; cin.clear(); cin.ignore(100, '\n'); continue; } else if (number % 2 != 0) { cout << "The Number is Prime.\n" << endl; } else if (number % 2 == 0) { cout << "Number is not Prime." << endl << endl; } cout << "Divisors of " << number << " are: " << endl; for (i = 1; i <= number; ++i) { if (number % i == 0) cout << i << " "; } for (i = 1; i < number; ++i) { if (number % i == 0) { sum += i; } } if (sum == number) { cout << "\n\n" << number << " is a Perfect Number\n\n"; } else { cout << "\n\n" << number << " is not a Perfect Number\n\n"; } } system("pause"); return 0; } 是具有列Names的表,而['Id', 'Name', 'DateId', 'Description']是具有列Dates的表,则列['Id', 'Date', 'Description']Id将加入后被复制。

Description

Names = sparkSession.sql("SELECT * FROM Names") Dates = sparkSession.sql("SELECT * FROM Dates") NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner") NamesAndDates = dropDupeDfCols(NamesAndDates) NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...") 的定义为:

dropDupeDfCols

结果数据框将包含列def dropDupeDfCols(df): newcols = [] dupcols = [] for i in range(len(df.columns)): if df.columns[i] not in newcols: newcols.append(df.columns[i]) else: dupcols.append(i) df = df.toDF(*[str(i) for i in range(len(df.columns))]) for dupcol in dupcols: df = df.drop(str(dupcol)) return df.toDF(*newcols)

答案 6 :(得分:0)

在pyspark中,您可以按照以下内容在多个列上加入

df = (df.groupby([df['User'].ne(df['User'].shift(1)).cumsum().values, 'User'])['Message']
        .agg(' '.join).reset_index(level=1))

原始答案来自:How to perform union on two DataFrames with different amounts of columns in spark?