我正在尝试在pyspark中读取带有双引号且内部带有逗号的csv文件。在pyspark中无法正常运行,但在pandas和excel中成功。
CSV文件样本文件(test.csv):
id,loan_amount,activity,use,country_code,country
1163317,1475.0,Clothing Sales,"to buy traditional clothing such as ""guipiles"" [traditional tunics or blouses], ""fajas"" [woven belts or sashes]",GT,Guatemala
pyspark代码:
df = spark.read.format("csv").option("header", True).option("quote", "\"").load("test.csv")
df.toPandas()
熊猫代码:
pdf = pd.read_csv("test.csv")
在pyspark中,可以选择使用option("mode", "DROPMALFORMED")
删除格式错误的行,但我不想删除这些行。