火花不使用节点

时间:2018-11-19 08:34:06

标签: apache-spark pyspark

我正在读取json.gz文件,如下所示:

qa_df =spark.read.json('qa_Clothing_Shoes_and_Jewelry.json.gz')
re_df=spark.read.json('reviews_Clothing_Shoes_and_Jewelry_5.json.gz')

print('data cleaning started')
def clean(data):
  data = str(data)
  data = data.lower()
  data = re.sub("[^a-zA-Z0-9- ]",' ',data)
  data = re.sub("   "," ",data)
  data = re.sub("  "," ",data)
  data = data.split()
  data = [ps.stem(x) for x in data if x not in stopwords_list]
  return data

udf_function = udf(clean,ArrayType(StringType()))
re_df = re_df.withColumn("tokenized_review", udf_function("reviewText"))
qa_df=qa_df.withColumn("tokenized_question",udf_function("question"))
qa_df=qa_df.withColumn("tokenized_answer",udf_function("answer")) 
print("Data cleaning complete")

然后我使用以下命令检查了节点状态:yarn node -lisr ..所有正在运行的容器都显示为0

0 个答案:

没有答案