我正在尝试使用pyspark为数据帧值打印阈值。 以下是我编写的R代码,但我希望在Pyspark中做到这一点,但我不知道如何在pyspark中做到这一点。任何帮助将不胜感激!
值数据框看起来像
values dataframe is
vote
0.3
0.1
0.23
0.45
0.9
0.80
0.36
# loop through all link weight values, from the lowest to the highest
for (i in 1:nrow(values)){
# print status
print(paste0("Iterations left: ", nrow(values) - i, " Threshold: ", values[i, w_vote]))
}
我在pyspark中尝试的是,但是我被困在这里
for row in values.collect():
print('iterations left:',row - i, "Threshold:', ...)
答案 0 :(得分:1)
每种语言或工具都有不同的处理方式。下面,我以您尝试的方式提供了答案-
df = sqlContext.createDataFrame([
[0.3],
[0.1],
[0.23],
[0.45],
[0.9],
[0.80],
[0.36]
], ["vote"])
values = df.collect()
toal_values = len(values)
#By default values from collect are not sorted using sorted to sort values in ascending order for vote column
# If you don't want to sort these values at python level just sort it at spark level by using df = df.sort("vote", ascending=False).collect()
# Using enumerate to knowing about index of row
for index, row in enumerate(sorted(values, key=lambda x:x.vote, reverse = False)):
print ('iterations left:', toal_values - (index+1), "Threshold:", row.vote)
iterations left: 6 Threshold: 0.1
iterations left: 5 Threshold: 0.23
iterations left: 4 Threshold: 0.3
iterations left: 3 Threshold: 0.36
iterations left: 2 Threshold: 0.45
iterations left: 1 Threshold: 0.8
iterations left: 0 Threshold: 0.9
不建议使用collect如果您要处理大数据,它将破坏程序。