在PySpark中加入两个数据框时如何解析重复的列名?

时间:2019-03-11 14:23:59

标签: python apache-spark pyspark apache-spark-sql

我有一个完全相同的文件A和B。我试图在这两个数据帧上执行内部和外部联接。由于我将所有列都作为重复的列,因此现有的答案没有帮助。 我遇到的其他问题包含一个或两个col作为重复项,我的问题是整个文件是彼此重复的:在数据和列名中。

我的代码:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import DataFrameReader, DataFrameWriter
from datetime import datetime

import time

# @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("All imports were successful.")

df = spark.read.orc(
    's3://****'
)
print("First dataframe read with headers set to True")
df2 = spark.read.orc(
    's3://****'
)
print("Second dataframe read with headers set to True")

# df3 = df.join(df2, ['c_0'], "outer")

# df3 = df.join(
#     df2,
#     df["column_test_1"] == df2["column_1"],
#     "outer"
# )

df3 = df.alias('l').join(df2.alias('r'), on='c_0') #.collect()

print("Dataframes have been joined successfully.")
output_file_path = 's3://****'
)

df3.write.orc(
    output_file_path
)
print("Dataframe has been written to csv.")
job.commit()

我面临的错误是:

pyspark.sql.utils.AnalysisException: u'Duplicate column(s): "c_4", "c_38", "c_13", "c_27", "c_50", "c_16", "c_23", "c_24", "c_1", "c_35", "c_30", "c_56", "c_34", "c_7", "c_46", "c_49", "c_57", "c_45", "c_31", "c_53", "c_19", "c_25", "c_10", "c_8", "c_14", "c_42", "c_20", "c_47", "c_36", "c_29", "c_15", "c_43", "c_32", "c_5", "c_37", "c_18", "c_54", "c_3", "__created_at__", "c_51", "c_48", "c_9", "c_21", "c_26", "c_44", "c_55", "c_2", "c_17", "c_40", "c_28", "c_33", "c_41", "c_22", "c_11", "c_12", "c_52", "c_6", "c_39" found, cannot save to file.;'
End of LogType:stdout

3 个答案:

答案 0 :(得分:0)

这里没有捷径。 Pyspark希望左右数据框具有不同的字段名称集(连接键除外)。

一种解决方案是为每个字段名称添加“ left_”或“ right_”前缀,如下所示:

# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns

# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])

# Perform join
df3 = df.alias('l').join(df2.alias('r'), on='c_0')

答案 1 :(得分:0)

我做了类似的事情,但是在 scala 中,您也可以将其转换为pyspark ...

  • 重命名每个数据框中的列名

    dataFrame1.columns.foreach(columnName => {
      dataFrame1 = dataFrame1.select(dataFrame1.columns.head, dataFrame1.columns.tail: _*).withColumnRenamed(columnName, s"left_$columnName")
    })
    
    dataFrame1.columns.foreach(columnName => {
      dataFrame2 = dataFrame2.select(dataFrame2.columns.head, dataFrame2.columns.tail: _*).withColumnRenamed(columnName, s"right_$columnName")
    })
    
  • 现在提到列名,join

    resultDF = dataframe1.join(dataframe2, dataframe1("left_c_0") === dataframe2("right_c_0"))
    

答案 2 :(得分:0)

这是加入两个数据帧添加别名的辅助函数:


set_unique = set()

for v in dictionary.values():
    for k in v.keys():
        set_unique.add(k)

print(set_unique)  # Output: {'MOM', 'RSI'}

for key in list(data.keys()):
    if key not in set_unique:
        del data[key]

print(data)   # Output: {'RSI': [{'TradingPair': 'BTCUSD', 'fetchSubscriptions': '[0]'}], 'MOM': [{'TradingPair': 'BCHUSDT', 'fetchSubscriptions': '[0]'}]}

这里有一个如何使用它的例子:

def join_with_aliases(left, right, on, how, right_prefix):
    renamed_right = right.selectExpr(
        [
            col + f" as {col}_{right_prefix}"
            for col in df2.columns
            if col not in on
        ]
        + on
    )
    right_on = [f"{x}{right_prefix}" for x in on]
    return left.join(renamed_right, on=on, how=how)