from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("Bingo").setMaster("local[*]")
sc = SparkContext(conf=conf)
def count_iterable(i):
return sum(1 for e in i)
input_data = sc.textFile(path to my csv file)
input_filtered = input_data.map(lambda row: row.split(",")) \
.filter(lambda row: row[14] == "2") \
.groupBy(lambda row: row[19] and row[20] and row[21]) \
.sortBy(lambda x: count_iterable(x[1]))
我希望rdd按Iterable中的元素大小排序,但我收到此错误:“ PermissionError:[WinError 32]该进程无法访问该文件,因为该文件正在被另一个进程使用”