当你加入两个具有相似列名的DF时:
df = df1.join(df2, df1['id'] == df2['id'])
加入工作正常,但您无法调用id
列,因为它不明确,您将获得以下异常:
pyspark.sql.utils.AnalysisException: "Reference 'id' is ambiguous, could be: id#5691, id#5918.;"
这使id
不再可用......
以下功能解决了这个问题:
def join(df1, df2, cond, how='left'):
df = df1.join(df2, cond, how=how)
repeated_columns = [c for c in df1.columns if c in df2.columns]
for col in repeated_columns:
df = df.drop(df2[col])
return df
我不喜欢的是我必须迭代列名并将其删除为一个。这看起来很笨重......
您是否知道任何其他解决方案可以更优雅地加入和删除重复项,还是删除多个列而不迭代每个列?
答案 0 :(得分:9)
如果两个数据框的连接列具有相同的名称并且您只需要equi连接,则可以将连接列指定为列表,在这种情况下,结果将只保留其中一个连接列:
df1.show()
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 4| 4|
| 5| 5|
+---+----+
df2.show()
+---+----+
| id|val2|
+---+----+
| 1| 2|
| 1| 3|
| 2| 4|
| 3| 5|
+---+----+
df1.join(df2, ['id']).show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
否则您需要提供连接数据框别名并稍后通过别名引用重复的列:
df1.alias("a").join(
df2.alias("b"), df1['id'] == df2['id']
).select("a.id", "a.val1", "b.val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
答案 1 :(得分:7)
df.join(other, on, how)
,当on
是列名字符串或列名字符串列表时,返回的数据框将防止重复列。
当on
是连接表达式时,它将导致重复的列。我们可以使用.drop(df.a)
删除重复的列。示例:
cond = [df.a == other.a, df.b == other.bb, df.c == other.ccc]
# result will have duplicate column a
result = df.join(other, cond, 'inner').drop(df.a)
答案 2 :(得分:2)
假设“ a”是具有“ id”列的数据框,而“ b”是具有“ id”列的另一个数据框
我使用以下两种方法删除重复项:
方法1:使用字符串连接表达式而不是布尔表达式。这会自动为您删除重复的列
a.join(b, 'id')
方法2:在连接之前重命名该列,在连接之后将其删除
b.withColumnRenamed('id', 'b_id')
joinexpr = a['id'] == b['b_id']
a.join(b, joinexpr).drop('b_id)
答案 3 :(得分:1)
以下代码适用于Spark 1.6.0及更高版本。
salespeople_df.show()
+---+------+-----+
|Num| Name|Store|
+---+------+-----+
| 1| Henry| 100|
| 2| Karen| 100|
| 3| Paul| 101|
| 4| Jimmy| 102|
| 5|Janice| 103|
+---+------+-----+
storeaddress_df.show()
+-----+--------------------+
|Store| Address|
+-----+--------------------+
| 100| 64 E Illinos Ave|
| 101| 74 Grand Pl|
| 102| 2298 Hwy 7|
| 103|No address available|
+-----+--------------------+
假设 - 在此示例中 - 共享列的名称相同:
joined=salespeople_df.join(storeaddress_df, ['Store'])
joined.orderBy('Num', ascending=True).show()
+-----+---+------+--------------------+
|Store|Num| Name| Address|
+-----+---+------+--------------------+
| 100| 1| Henry| 64 E Illinos Ave|
| 100| 2| Karen| 64 E Illinos Ave|
| 101| 3| Paul| 74 Grand Pl|
| 102| 4| Jimmy| 2298 Hwy 7|
| 103| 5|Janice|No address available|
+-----+---+------+--------------------+
.join
将阻止共享列的重复。
我们假设您要删除此示例中的列Num
,您只需使用.drop('colname')
joined=joined.drop('Num')
joined.show()
+-----+------+--------------------+
|Store| Name| Address|
+-----+------+--------------------+
| 103|Janice|No address available|
| 100| Henry| 64 E Illinos Ave|
| 100| Karen| 64 E Illinos Ave|
| 101| Paul| 74 Grand Pl|
| 102| Jimmy| 2298 Hwy 7|
+-----+------+--------------------+
答案 4 :(得分:1)
在我的情况下,加入后我的数据框具有多个重复的列,并且我尝试使用csv格式的该数据框,但是由于重复的列,我遇到了错误。我按照以下步骤删除重复的列。代码在scala中
1) Rename all the duplicate columns and make new dataframe
2) make separate list for all the renamed columns
3) Make new dataframe with all columns (including renamed - step 1)
4) drop all the renamed column
private def removeDuplicateColumns(dataFrame:DataFrame): DataFrame = {
var allColumns: mutable.MutableList[String] = mutable.MutableList()
val dup_Columns: mutable.MutableList[String] = mutable.MutableList()
dataFrame.columns.foreach((i: String) =>{
if(allColumns.contains(i))
if(allColumns.contains(i))
{allColumns += "dup_" + i
dup_Columns += "dup_" +i
}else{
allColumns += i
}println(i)
})
val columnSeq = allColumns.toSeq
val df = dataFrame.toDF(columnSeq:_*)
val unDF = df.drop(dup_Columns:_*)
unDF
}
to call the above function use below code and pass your dataframe which contains duplicate columns
val uniColDF = removeDuplicateColumns(df)
答案 5 :(得分:0)
将多个表连接在一起后,如果它在从左向右移动时遇到重复,则通过一个简单的函数将它们运行到DF中以删除列。或者,you could rename these columns too。
如果#include < iostream >
using namespace std;
int main()
{
int number{}, i=1, sum = 0;
while ((number >= 0) && (number <= 1000)) //valid option
{
cout << "Enter a positive integer between 0 and 1000: ";
cin >> number;
cout << endl;
if (!(cin) || (number <= 0) || (number >= 1000))
{
cout << "The number you have entered is invalid\n" << endl;
cin.clear();
cin.ignore(100, '\n');
continue;
}
else if (number % 2 != 0)
{
cout << "The Number is Prime.\n" << endl;
}
else if (number % 2 == 0)
{
cout << "Number is not Prime." << endl << endl;
}
cout << "Divisors of " << number << " are: " << endl;
for (i = 1; i <= number; ++i)
{
if (number % i == 0)
cout << i << " ";
}
for (i = 1; i < number; ++i) {
if (number % i == 0) {
sum += i;
}
}
if (sum == number) {
cout << "\n\n" << number << " is a Perfect Number\n\n";
}
else {
cout << "\n\n" << number << " is not a Perfect Number\n\n";
}
}
system("pause");
return 0;
}
是具有列Names
的表,而['Id', 'Name', 'DateId', 'Description']
是具有列Dates
的表,则列['Id', 'Date', 'Description']
和Id
将加入后被复制。
Description
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = dropDupeDfCols(NamesAndDates)
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
的定义为:
dropDupeDfCols
结果数据框将包含列def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
。
答案 6 :(得分:0)
在pyspark中,您可以按照以下内容在多个列上加入
df = (df.groupby([df['User'].ne(df['User'].shift(1)).cumsum().values, 'User'])['Message']
.agg(' '.join).reset_index(level=1))
原始答案来自:How to perform union on two DataFrames with different amounts of columns in spark?