如何从PySpark中的文本文件创建DataFrame?

时间:2019-07-11 13:56:16

标签: dataframe text pyspark df

我是pyspark的新手,我想将txt文件转换为Pyspark中的数据框。我正在尝试在pyspark中整理数据。有什么帮助吗?谢谢

我已经尝试过将其转换为RDD,然后转换为datafram,但是它对我不起作用,所以我决定将其一次从txt文件转换为数据帧

我正在尝试此操作,但尚未成功。

 # read input text file to RDD
  lines = sc.textFile("/home/h110-3/workspace/spark/weather01.txt")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

我无法将其转换为数据框。请帮助

1 个答案:

答案 0 :(得分:0)

您可以通过text阅读器...此处示例:

! cat sample.txt
hello there
loading line by line
via apache spark
text df api
print(spark.version)
df = spark.read.text("sample.txt")
df.printSchema()
df.show()
df.selectExpr("split(value, ' ') as rows").show(3, False)

2.4.3
root
 |-- value: string (nullable = true)

+--------------------+
|               value|
+--------------------+
|         hello there|
|loading line by line|
|    via apache spark|
|         text df api|
+--------------------+
+-------------------------+
|rows                     |
+-------------------------+
|[hello, there]           |
|[loading, line, by, line]|
|[via, apache, spark]     |
+-------------------------+