我正在尝试使用pyspark读取csv文件,但显示一些错误。 您能告诉我读取csv文件的正确过程是什么吗?
python代码:
from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
我也尝试过以下一种方法:
sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
错误:
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined
and
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
AttributeError: type object 'SQLContext' has no attribute 'load'
答案 0 :(得分:1)
首先,您需要创建如下所示的SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()
您的csv必须在hdfs上,然后您就可以使用spark.csv
df = spark.read.csv('/tmp/data.csv', header=True)
/tmp/data.csv在hdfs上
答案 1 :(得分:0)
在pyspark中最简单的读取csv-使用Databrick的spark-csv模块。
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
您还可以按字符串阅读并解析到分隔符。
reader = sc.textFile("file.csv").map(lambda line: line.split(","))