Pyspark与Pandas CSV阅读器

时间:2018-07-08 17:37:25

标签: python pandas csv apache-spark pyspark

我正在尝试在pyspark中读取带有双引号且内部带有逗号的csv文件。在pyspark中无法正常运行,但在pandas和excel中成功。

CSV文件样本文件(test.csv):

id,loan_amount,activity,use,country_code,country
1163317,1475.0,Clothing Sales,"to buy traditional clothing such as ""guipiles"" [traditional tunics or blouses],  ""fajas"" [woven belts or sashes]",GT,Guatemala

pyspark代码:

df = spark.read.format("csv").option("header", True).option("quote", "\"").load("test.csv")
df.toPandas()

enter image description here

熊猫代码:

pdf = pd.read_csv("test.csv")

enter image description here

在pyspark中,可以选择使用option("mode", "DROPMALFORMED")删除格式错误的行,但我不想删除这些行。

0 个答案:

没有答案