Pyspark:如何将数据帧中的字符串数组转换为时间戳数组

时间:2020-01-30 00:16:53

标签: pyspark pyspark-sql pyspark-dataframes

我运行一个简单的查询,以使用pyspark sql将cookie作为字符串和时间戳作为数组。 我想将它们传递给我的用户定义函数,但是时间戳数组作为unicodes数组传递。 有人可以帮我解决这个问题。谢谢

@udf(returnType=StringType())
def PrintDetails(cookie, timestamps, current_day, current_hourly_threshold,current_daily_threshold):
    print(type(timestamps[0]))
def main(argv):
    spark = SparkSession \
        .builder \
        .appName("parquet_test") \
        .config("spark.debug.maxToStringFields", "100") \
        .getOrCreate()

    inputPath = r'D:\Hadoop\Spark\parquet_input_files'
    inputFiles = os.path.join(inputPath, '*.parquet')

    impressionDate =  datetime.strptime("2019_12_31", '%Y_%m_%d')
    current_hourly_threshold = 40
    current_daily_threshold = 200

    parquetFile = spark.read.parquet(inputFiles)
    parquetFile.createOrReplaceTempView("parquetFile")
    cookie_and_time = spark.sql("SELECT cookie, collect_list(date_format(from_unixtime(ts), 'YYYY-MM-dd-hh:mm:ss'))  as imp_times FROM parquetFile group by 1  ")

    cookie_df = cookie_and_time.withColumn("cookies", PrintDetails(cookie_and_time['cookie'], cookie_and_time['imp_times'], lit(impressionDate), lit(current_hourly_threshold), lit(current_daily_threshold)))
    cookie_df.show()

if __name__ == "__main__":
    main(sys.argv)

0 个答案:

没有答案