使用数据框查询-查找10条最常见的路线

时间:2018-12-28 02:19:31

标签: python dataframe pyspark bigdata

我有一个基于此问题website的学校作业-我们使用此处提供的数据集。 我正在尝试执行以下操作:在一周的每一天的每个小时内,找到前10条最频繁的路线。输出应为:工作日,小时,[路线1,…,路线10]。 我的代码是:

from pyspark.sql import *
from pyspark.sql.types import * 
from pyspark.sql.functions import * 

import datetime  
import time 

start_time = time.time() 
spark = SparkSession.builder.master('local[*]').appName('taxis').getOrCreate() sc = spark.sparkContext sc.setLogLevel("ERROR")

timeformat = "yyyy-MM-dd HH:mm:ss" 
dateformat = "EEEE"

lat = 41.474937   #first cell is (1,1) 
long = -74.913585 
south = 0.004491556 
east = 0.005986

try :
    lines = sc.textFile('sorted_data.csv')
    taxisRows = lines.filter( lambda line : len(line) > 0 )   \
                        .map( lambda line : line.split(',') ) \
                        .filter( lambda split_line : (float(split_line[6]) != 0) \
                                                    and (float(split_line[7]) != 0) \
                                                    and (float(split_line[8]) != 0) \
                                                    and (float(split_line[9]) != 0)) \
                        .map( lambda arr : Row(pickup_datetime = arr[2], dropoff_datetime = arr[3], \
                                                pickup_longitude = (float(arr[6]) - long), \
                                                pickup_latitude = (float(arr[7]) - lat), \
                                                dropoff_longitude = (float(arr[8]) - long), \
                                                dropoff_latitude = (float(arr[9]) - lat), \
                                                )) 

    taxisRowsDF = spark.createDataFrame( taxisRows )

    taxisRowsDF = taxisRowsDF.withColumn('route', struct( struct((round((abs(taxisRowsDF.pickup_latitude)/south)+1)), (round((abs(taxisRowsDF.pickup_longitude)/east)+1))) , \
                                                        struct((round((abs(taxisRowsDF.dropoff_latitude)/south)+1)), (round((abs(taxisRowsDF.dropoff_longitude)/east)+1))) ) )


    taxisRowsDF = taxisRowsDF.withColumn("weekday",date_format('pickup_datetime', format= 'E'))
    taxisRowsDF = taxisRowsDF.withColumn("hour", date_format("pickup_datetime", format = 'H'))

    routesFrequencyDF = taxisRowsDF.groupBy('weekday', 'hour', 'route').count().orderBy('count',ascending = False)
    tenMostFrequent = routesFrequencyDF.groupBy('weekday', 'hour').agg(collect_set('route').alias('List of Routes')).show()

    #tenMostFrequent1 = tenMostFrequent.select('List of Routes', size('List of Routes').alias('Number of Routes'))

    #tenMostFrequent.show(tenMostFrequent.count, False)
    #tenMostFrequent.show(10)
#     taxisRowsDF.show(10)                
#     routesFrequencyDF.show(10)
    print("---%s seconds---"% (time.time()-start_time))
    sc.stop() except Exception as e:
    print(e)
    sc.stop()

我通过routesFrequencyDF获得了每条路线的频率,并且通过使用agg(colect_set())可以创建一组值(这样输出的结果就是每周和每小时的10条最频繁路线的列表) ,尽管我无法将这两部分信息结合在一起。有人有什么建议吗?

0 个答案:

没有答案