我创建了一个没有标题的PySpark RDD(从XML转换为CSV)。我需要将其转换为带有标头的DataFrame,以对其执行一些SparkSQL查询。我似乎找不到添加标头的简单方法。大多数示例都是从已经具有标题的数据集开始的。
Map
但是,我需要附加标题。
df = spark.read.csv('some.csv', header=True, schema=schema)
这似乎是一个小问题,我不确定为什么找不到有效的解决方案。谢谢。
答案 0 :(得分:2)
rdd.toDF(schema = ['a','b','c','d']
答案 1 :(得分:0)
像这样...如果您的csv不包含标题行,则需要指定架构和.option("header", "false")
spark.version
'2.3.2'
! cat sample.csv
1, 2.0,"hello"
3, 4.0, "there"
5, 6.0, "how are you?"
PATH = "sample.csv"
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([\
StructField("col1", IntegerType(), True),\
StructField("col2", FloatType(), True),\
StructField("col3", StringType(), True)])
csvFile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
csvFile.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| 2.0| hello|
| 3| 4.0| "there"|
| 5| 6.0| "how are you?"|
+----+----+---------------+
# if you have rdd and want to convert straight to df
rdd = sc.textFile(PATH)
# just showing rows
for i in rdd.collect(): print(i)
1, 2.0,"hello"
3, 4.0, "there"
5, 6.0, "how are you?"
# use Row to construct a schema from rdd
from pyspark.sql import Row
csvDF = rdd\
.map(lambda x: Row(col1 = int(x.split(",")[0]),\
col2 = float(x.split(",")[1]),\
col3 = str(x.split(",")[2]))).toDF()
csvDF.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| 2.0| "hello"|
| 3| 4.0| "there"|
| 5| 6.0| "how are you?"|
+----+----+---------------+
csvDF.printSchema()
root
|-- col1: long (nullable = true)
|-- col2: double (nullable = true)
|-- col3: string (nullable = true)