我在多个JSON文件中有一堆原始数据(在Azure Blob存储上)。我试过做spark.read.json()
似乎工作但很慢......我想知道是否有办法逐个读取文件进行处理?
我的处理只是过滤,我想省略某些属性并转换数据,例如拆分字符串并转换成数字。
from pyspark.sql import *
from pyspark.sql.functions import split, lower, unix_timestamp, from_unixtime
rawQuestionsDf = spark.read.json("wasb://....blob.core.windows.net/pquestionsjson/datajson%2F*.json")
questionsDf = rawQuestionsDf.select('id', 'title', 'tags', 'owner_user_id', 'accepted_answer_id', 'view_count',
'answer_count', 'comment_count', 'creation_date', 'favorite_count') \
.withColumn('tags', split(df['tags'], '\|')) \
.withColumn('title', lower(df['title'])) \
.withColumn('view_count', df['view_count'].cast('integer')) \
.withColumn('answer_count', df['answer_count'].cast('integer')) \
.withColumn('comment_count', df['comment_count'].cast('integer')) \
.withColumn('favorite_count', df['favorite_count'].cast('integer')) \
.withColumn('creation_date', df['creation_date'])
questionsDf.write.parquet("wasb://data@cs4225.blob.core.windows.net/filtered/questions.parquet")
我想知道是否有更好的方式? wholeTextFile
会更好吗?