Question

我正在尝试使用pyspark为数据帧值打印阈值。以下是我编写的R代码，但我希望在Pyspark中做到这一点，但我不知道如何在pyspark中做到这一点。任何帮助将不胜感激！

值数据框看起来像

values dataframe is

vote
0.3
0.1
0.23
0.45
0.9
0.80
0.36

# loop through all link weight values, from the lowest to the highest
for (i in 1:nrow(values)){
  # print status
  print(paste0("Iterations left: ", nrow(values) - i, "   Threshold: ", values[i, w_vote]))
}

我在pyspark中尝试的是，但是我被困在这里

for row in values.collect():
     print('iterations left:',row - i, "Threshold:', ...)

Answer 1

每种语言或工具都有不同的处理方式。下面，我以您尝试的方式提供了答案-

df = sqlContext.createDataFrame([
[0.3],
[0.1],
[0.23],
[0.45],
[0.9],
[0.80],
[0.36]
], ["vote"])

values = df.collect()
toal_values = len(values)

#By default values from collect are not sorted using sorted to sort values in ascending order for vote column
# If you don't want to sort these values at python level just sort it at spark level by using df = df.sort("vote", ascending=False).collect()
# Using enumerate to knowing about index of row

for index, row in enumerate(sorted(values, key=lambda x:x.vote, reverse = False)):
     print ('iterations left:', toal_values - (index+1), "Threshold:", row.vote)

iterations left: 6 Threshold: 0.1
iterations left: 5 Threshold: 0.23
iterations left: 4 Threshold: 0.3
iterations left: 3 Threshold: 0.36
iterations left: 2 Threshold: 0.45
iterations left: 1 Threshold: 0.8
iterations left: 0 Threshold: 0.9

不建议使用collect如果您要处理大数据，它将破坏程序。

如何使用pyspark for loop打印迭代值

1 个答案: