Question

我在多个JSON文件中有一堆原始数据（在Azure Blob存储上）。我试过做spark.read.json()似乎工作但很慢......我想知道是否有办法逐个读取文件进行处理？

我的处理只是过滤，我想省略某些属性并转换数据，例如拆分字符串并转换成数字。

from pyspark.sql import *
from pyspark.sql.functions import split, lower, unix_timestamp, from_unixtime

rawQuestionsDf = spark.read.json("wasb://....blob.core.windows.net/pquestionsjson/datajson%2F*.json")

questionsDf = rawQuestionsDf.select('id', 'title', 'tags', 'owner_user_id', 'accepted_answer_id', 'view_count',
                       'answer_count', 'comment_count', 'creation_date', 'favorite_count') \
    .withColumn('tags', split(df['tags'], '\|')) \
    .withColumn('title', lower(df['title'])) \
    .withColumn('view_count', df['view_count'].cast('integer')) \
    .withColumn('answer_count', df['answer_count'].cast('integer')) \
    .withColumn('comment_count', df['comment_count'].cast('integer')) \
    .withColumn('favorite_count', df['favorite_count'].cast('integer')) \
    .withColumn('creation_date', df['creation_date'])

questionsDf.write.parquet("wasb://data@cs4225.blob.core.windows.net/filtered/questions.parquet")

我想知道是否有更好的方式？ wholeTextFile会更好吗？

PySpark读入多个文件，以便在有限的RAM下进行处理

0 个答案: