以下是我的数据样本输入。从C开始可以有多列,分数各不相同。
输出将遵循此逻辑-对于A的特定值,将针对每一行固定A,B和E列。输入(C,E ... X)的每一列将对应于每一行。如果遇到空值,则需要丢弃它,然后在下一行中搜索。一旦遇到C或D以获取任何特定的A值,我将转到下一个A值。 简而言之,对于每个A值,我们都需要C和d的最小值。
答案 0 :(得分:0)
您说的是<< em>简而言之,对于每个A值,我们都需要C和D的最小值。'因此,通过这种逻辑,我计算了一个A的C和D的最小值。特别是A。输出的第三行与我的不匹配,因为130 D的最小值为100.09。如果逻辑上有一些更改,则可以根据需要进行相应的更改。
from pyspark.sql.types import StringType, FloatType
from pyspark.sql import Row, window
from pyspark.sql.functions import array, col, explode, struct, lit
schema = StructType([StructField('A', StringType()), StructField('B',FloatType()),
StructField('C',FloatType()),StructField('D',FloatType()),
StructField('E',FloatType())])
rows = [Row(A='123',B=None,C=100.22,D=None,E=3501.88), Row(A='123',B=None,C=102.212,D=101.2187,E=3502.88),
Row(A='123',B=None,C=103.22,D=103.22,E=3503.22), Row(A='130', B=None, C=None, D=101.22, E=355.0),
Row(A='130',B=None,C=None,D=102.28,E=356.8), Row(A='130',B=None,C=100.09,D=100.09,E=357.8)]
df = spark.createDataFrame(rows, schema)
df.show()
+---+----+-------+--------+-------+
| A| B| C| D| E|
+---+----+-------+--------+-------+
|123|null| 100.22| null|3501.88|
|123|null|102.212|101.2187|3502.88|
|123|null| 103.22| 103.22|3503.22|
|130|null| null| 101.22| 355.0|
|130|null| null| 102.28| 356.8|
|130|null| 100.09| 100.09| 357.8|
+---+----+-------+--------+-------+
#This function is used to explode the DataFrame
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_long(df[['A','C','D','E']], ['A','E'])
#df.show()
df = df.select(col('A'), col('Key').alias('XX'), col('val').alias('Score'), col('E').alias('ZZ'))
#df.show()
df = df.where(col("Score").isNotNull())
#df.show()
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select A, XX, min(Score) over (partition by A) as Score, ZZ from table_view'
)
df.registerTempTable('table_view')
df1=sqlContext.sql(
'SELECT A, XX, Score, ZZ from (select *, min(Score) over (partition by A, XX) as minScore FROM table_view) M where Score = minScore'
)
df1.show()
+---+---+--------+-------+
| A| XX| Score| ZZ|
+---+---+--------+-------+
|123| C| 100.22|3501.88|
|123| D|101.2187|3502.88|
|130| C| 100.09| 357.8|
|130| D| 100.09| 357.8|
+---+---+--------+-------+