从大型数据集中删除重复列的功能

时间:2018-12-19 22:24:20

标签: pyspark-sql

试图在加入hdfs表后删除pyspark df中重复的列名吗?

您好,我正在尝试将多个数据集与最终200多个列连接在一起。由于要求和高列数,我在加入时无法选择特定的列。有没有一种方法可以删除联接后重复的列。我知道有一种方法可以通过spark df的.join方法实现,但是我要连接的基表不是spark df,并且我试图避免在连接之前将它们转换为spark df。

原始pyspark连接查询以创建Spark DF#

cust_base=sqlc.sql('''
Select distinct *
FROM db.tbl1 as t1
LEFT JOIN db.tbl2 as t2 ON (t1.acct_id=t2.acct_id) 
LEFT JOIN db.tbl3 as t3 ON (t1.cust_id=t3.cust_id)
WHERE t1.acct_subfam_mn IN ('PIA','PIM','IAA')
AND t1.active_acct_ct <> 0
AND t1.efectv_dt = '2018-10-31'
AND (t2.last_change_dt<='2018-10-31' AND (t2.to_dt is null OR t2.to_dt > 
'2018-10-31'))
AND (t3.last_change_dt<='2018-10-31' AND (t3.to_dt is null OR t3.to_dt > 
'2018-10-31'))
''').registerTempTable("df1")

检查cust_id的不重复计数时出错

 a=sqlc.sql('''
 Select 
 count(distinct a.cust_id) as CT_ID
 From df1
 ''')

AnalysisException: "Reference 'cust_id' is ambiguous, could be: cust_id#7L, 
cust_id#171L.; line 3 pos 15"

This is 'cust_id' field present more than once due to join

我想从结果加入的df中删除重复的列。 预先感谢

1 个答案:

答案 0 :(得分:0)

我可以帮助编写一个函数来查找给定数据框中的重复列。

下面让我们说一下具有重复列的数据框:

+------+----------------+----------+------+----------------+----------+
|emp_id|emp_joining_date|emp_salary|emp_id|emp_joining_date|emp_salary|
+------+----------------+----------+------+----------------+----------+
|     3|      2018-12-06|     92000|     3|      2018-12-06|     92000|
+------+----------------+----------+------+----------------+----------+

def finddups(*args):
    import collections
    dupes = []
    for cols in args:
        [dupes.append(item) for item, count in collections.Counter(cols).items() if count > 1]
        return dupes

   >>> duplicatecols = finddups(df.columns)
>>> print duplicatecols
['emp_id', 'emp_joining_date', 'emp_salary']