我正在读取json.gz文件,如下所示:
qa_df =spark.read.json('qa_Clothing_Shoes_and_Jewelry.json.gz')
re_df=spark.read.json('reviews_Clothing_Shoes_and_Jewelry_5.json.gz')
print('data cleaning started')
def clean(data):
data = str(data)
data = data.lower()
data = re.sub("[^a-zA-Z0-9- ]",' ',data)
data = re.sub(" "," ",data)
data = re.sub(" "," ",data)
data = data.split()
data = [ps.stem(x) for x in data if x not in stopwords_list]
return data
udf_function = udf(clean,ArrayType(StringType()))
re_df = re_df.withColumn("tokenized_review", udf_function("reviewText"))
qa_df=qa_df.withColumn("tokenized_question",udf_function("question"))
qa_df=qa_df.withColumn("tokenized_answer",udf_function("answer"))
print("Data cleaning complete")
然后我使用以下命令检查了节点状态:yarn node -lisr ..所有正在运行的容器都显示为0