我想比较两个PySpark数据帧。
我有带有数百列(Col1,Col2,...,Col800)的Df1和带有数百个相应行的Df2。
Df2描述了Df1中800列中每列的极限值,如果该值太低或太高,那么我想在Final_Df中获得结果,我在其中创建列Problem
检查是否有任何列超出限制。
我曾考虑过用Divot转置Df2,但是它需要一个聚合函数,所以我不确定这是否是一个相关的解决方案。
我也看不到如何加入两个Df进行比较,因为它们不共享任何公共列。
Df1:
| X | Y | Col1 | Col2 | Col3 |
+-----------+-----------+------+------+------+
| Value_X_1 | Value_Y_1 | 5000 | 250 | 500 |
+-----------+-----------+------+------+------+
| Value_X_2 | Value_Y_2 | 1000 | 30 | 300 |
+-----------+-----------+------+------+------+
| Value_X_3 | Value_Y_3 | 0 | 100 | 100 |
+-----------+-----------+------+------+------+
Df2:
+------+------+-----+
| name | max | min |
+------+------+-----+
| Col1 | 2500 | 0 |
+------+------+-----+
| Col2 | 120 | 0 |
+------+------+-----+
| Col3 | 400 | 0 |
+------+------+-----+
Final_Df(经过比较):
+-----------+-----------+------+------+------+---------+
| X | Y | Col1 | Col2 | Col3 | Problem |
+-----------+-----------+------+------+------+---------+
| Value_X_1 | Value_Y_1 | 5000 | 250 | 500 | Yes |
+-----------+-----------+------+------+------+---------+
| Value_X_2 | Value_Y_2 | 1000 | 30 | 300 | No |
+-----------+-----------+------+------+------+---------+
| Value_X_3 | Value_Y_3 | 0 | 100 | 100 | No |
+-----------+-----------+------+------+------+---------+
答案 0 :(得分:2)
如果df2
不是大数据框,则可以将其转换为字典,然后使用列表推导和 when 函数检查状态,例如:
from pyspark.sql import functions as F
>>> df1.show()
+---------+---------+----+----+----+
| X| Y|Col1|Col2|Col3|
+---------+---------+----+----+----+
|Value_X_1|Value_Y_1|5000| 250| 500|
|Value_X_2|Value_Y_2|1000| 30| 300|
|Value_X_3|Value_Y_3| 0| 100| 100|
+---------+---------+----+----+----+
>>> df2.show()
+----+----+---+
|name| max|min|
+----+----+---+
|Col1|2500| 0|
|Col2| 120| 0|
|Col3| 400| 0|
+----+----+---+
# concerned columns
cols = df1.columns[2:]
>>> cols
['Col1', 'Col2', 'Col3']
注意:我假设在df1和df2.min,df2.max中,上述cols的数据类型已经设置为整数。
从df2创建地图:
map1 = { r.name:[r.min, r.max] for r in df2.collect() }
>>> map1
{u'Col1': [0, 2500], u'Col2': [0, 120], u'Col3': [0, 400]}
基于两个when()函数添加新的字段“问题”,使用列表推导对所有相关列进行迭代
- F.when(df1 [c] .weenween(min,max),0).otherwise(1))
- F.when(sum(...)> 0,'是')。否则('No')
我们使用第一个when()
函数为每个相关列设置一个标志(0或1),然后对该标志求和。如果大于0,则问题='是',否则为'否':
df_new = df1.withColumn('Problem', F.when(sum([ F.when(df1[c].between(map1[c][0], map1[c][1]), 0).otherwise(1) for c in cols ]) > 0, 'Yes').otherwise('No'))
>>> df_new.show()
+---------+---------+----+----+----+-------+
| X| Y|Col1|Col2|Col3|Problem|
+---------+---------+----+----+----+-------+
|Value_X_1|Value_Y_1|5000| 250| 500| Yes|
|Value_X_2|Value_Y_2|1000| 30| 300| No|
|Value_X_3|Value_Y_3| 0| 100| 100| No|
+---------+---------+----+----+----+-------+
答案 1 :(得分:1)
使用UDF和字典,我能够解决它。让我知道它是否有帮助。
# Create a map like, name -> max#min
df = df.withColumn('name_max_min',F.create_map('name',F.concat( col('max'), lit("#"), col('min')) ))
# HANDLE THE null
# you can try this ,not sure about this , but python has math.inf which
# supplies both infinities
positiveInf = float("inf")
negativeInf = float("-inf")
df = df.fillna({ 'max':999999999, 'min':-999999999 })
### df is :
+----+----+---+-------------------+
|name| max|min| name_max_min|
+----+----+---+-------------------+
|Col1|2500| 0|Map(Col1 -> 2500#0)|
|Col2| 120| 0| Map(Col2 -> 120#0)|
|Col3| 400| 0| Map(Col3 -> 400#0)|
+----+----+---+-------------------+
# Create a dictionary out of it
v = df.select('name_max_min').rdd.flatMap(lambda x: x).collect()
keys = []
values = []
for p in v:
for r, s in p.items():
keys.append(str(r).strip())
values.append(str(s).strip().split('#'))
max_dict = dict(zip(keys,values))
# max_dict = {'Col1': ['2500', '0'], 'Col2': ['120', '0'], 'Col3': ['400', '0']}
# Create a UDF which can help you to assess the conditions.
def problem_udf(c1):
# GENERAL WAY
# if the column names are diff
#p =all([int(max_dict.get(r)[1]) <= int(c1[r]) <= int(max_dict.get(r)[0]) for r in c1.__fields__])
p = all([ int(max_dict.get("Col" + str(r))[1]) <= int(c1["Col" + str(r)]) <= int(max_dict.get("Col" + str(r))[0]) for r in range(1, len(c1) + 1)])
if p :
return("No")
else:
return("Yes")
callnewColsUdf= F.udf(problem_udf, StringType())
col_names = ['Col'+str(i) for i in range(1,4)]
# GENERAL WAY
# col_names = df1.schema.names
df1 = df1.withColumn('Problem', callnewColsUdf(F.struct(col_names)))
## Results in :
+---------+---------+----+----+----+-------+
| X| Y|Col1|Col2|Col3|Problem|
+---------+---------+----+----+----+-------+
|Value_X_1|Value_Y_1|5000| 250| 500| Yes|
|Value_X_2|Value_Y_2|1000| 30| 300| No|
|Value_X_3|Value_X_3| 0| 100| 100| No|
+---------+---------+----+----+----+-------+