我不了解这个简单的PySpark代码片段的行为:
# Create simple test dataframe
l = [('Alice', 1),('Pierre', 3),('Jack', 5), ('Paul', 2)]
df_test = sqlcontext.createDataFrame(l, ['name', 'age'])
# Perform filter then Take 2 oldest
df_test = df_test.sort('age', ascending=False)\
.filter('age < 4') \
.limit(2)
df_test.show(2)
# This outputs as expected :
# +------+---+
# | name|age|
# +------+---+
# |Pierre| 3|
# | Paul| 2|
# +------+---+
df_test.collect()
# This outputs unexpectedly :
# [Row(name=u'Pierre', age=3), Row(name=u'Alice', age=1)]
这是collect()函数的预期行为吗?如何将列检索为保持正确顺序的列表?
由于
答案 0 :(得分:0)
我必须使用分拣机UDF来解决此问题
def sorter(l):
import operator
res = sorted(l, key =operator.itemgetter(0))
L1=[item[1] for item in res]
#return " ".join(str(x) for x in L)
return "".join(L1)
sort_udf = udf(sorter)