我是pyspark的新手,我想将txt文件转换为Pyspark中的数据框。我正在尝试在pyspark中整理数据。有什么帮助吗?谢谢
我已经尝试过将其转换为RDD,然后转换为datafram,但是它对我不起作用,所以我决定将其一次从txt文件转换为数据帧
我正在尝试此操作,但尚未成功。
# read input text file to RDD
lines = sc.textFile("/home/h110-3/workspace/spark/weather01.txt")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
我无法将其转换为数据框。请帮助
答案 0 :(得分:0)
您可以通过text
阅读器...此处示例:
! cat sample.txt
hello there
loading line by line
via apache spark
text df api
print(spark.version)
df = spark.read.text("sample.txt")
df.printSchema()
df.show()
df.selectExpr("split(value, ' ') as rows").show(3, False)
2.4.3
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| hello there|
|loading line by line|
| via apache spark|
| text df api|
+--------------------+
+-------------------------+
|rows |
+-------------------------+
|[hello, there] |
|[loading, line, by, line]|
|[via, apache, spark] |
+-------------------------+