我在将DataFrame转换为RDD时遇到了问题。 数据框最初是从csv文件构建的:
columns = 'trip_date;trip_id;op_id;op_abk;op_name;transport_type;train_id;train_service;none;train_service_full;'\
'additional_trip;failed;stop_id;stop_name;sched_arrival;actual_arrival;arrival_status;sched_departure;' \
'actual_departure;departure_status;no_stop'.split(';')
types = [StringType()] * 6 + [IntegerType()] + [StringType()]*3 + [BooleanType()] * 2 +[StringType()] * 8 + [BooleanType()]
schema = StructType([StructField(name, t, False) for (name, t) in zip(columns, types)])
然后从csv和过滤器
中读取df = spark.read.csv('/datasets/project/istdaten/*/*',schema=schema, sep=';', header=True)
filtered = df.filter(df.stop_id.isin(selected_stops))
选定的停靠点是一个字符串数组 然后计算停靠点:
filtered.select(filtered.stop_id).distinct().count() [1004 distinct stops]
filtered.rdd.map(lambda r : r.stop_id).distinct().count() [69 distinct stops]
我不知道这是怎么回事。也许重新分区?