Question

我有一个带有文本列的PySpark数据框。如果Pyspark数据框包含“，”，它似乎会截断文本列的内容。

这就是我保存文件的方式。


read_file = spark.read.csv('C://data/myfile.csv', header = True, inferSchema = True)

# Do some processing then save file as csv
read_file = read_file.select(read_file.text_col.cast(StringType())
read_file.coalesce(1).write.csv('text.csv', mode = 'overwrite', header = True)

Sample of text:

Bot[10/26/2019 09:21:44]: Hi there, welcome to XXX. I will be your virtual assistant today.

After saving it will output this:
>>> Bot[10/26/2019 09:21:44]: Hi there

我尝试将列转换为StringType，但是如果包含'，'的列仍会被截断。

Answer 1

tl; dr 将.option("delimiter", "|")与任何定界符（分隔符）输入数据集一起使用。

我有带有文本列的PySpark数据框。

这建议改用text()（而不是csv()）方法。

如果Pyspark数据框包含“，”，它将截断文本列的内容。

这是csv()方法（实际上是CSV数据源），它是根据默认配置加载数据集的，该默认配置假定,（逗号）为分隔符。它不会截断，而是根据分隔符解析行。

使用delimiter（或sep）选项应“修复”它。

如何使用逗号分隔的分隔符读取CSV文件？

1 个答案: