我的csv
列中有换行符。以下是我的示例:
"A","B","C"
1,"This is csv with
newline","This is another column"
"This is newline
and another line","apple","cat"
我可以在spark中读取文件,但列中的换行符被视为一个单独的行。
我如何准备这个作为csv,文本用双引号括起来。
我只使用apache csv插件和apache读取文件。
alarms = sc.textFile("D:\Dataset\oneday\oneday.csv")
这给了我RDD:
**example.take(5)**
[u'A,B,C', u'1,"This is csv with ', u'newline",This is another column', u'"This is newline', u'and another line",apple,cat']
Spark版本:1.4
答案 0 :(得分:2)
标准python库中的csv模块开箱即用:
>>> txt = '''"A","B","C"
1,"This is csv with
newline","This is another column"
"This is newline
and another line","apple","cat"'''
>>> import csv
>>> import io
>>> with io.BytesIO(txt) as fd:
rd = csv.reader(fd)
for row in rd:
print row
['A', 'B', 'C']
['1', 'This is csv with \nnewline', 'This is another column']
['This is newline\nand another line', 'apple', 'cat']
textFile
可以使用此功能(对binaryFiles
造成重大性能损失):
>>> (sc.binaryFiles(path)
.values()
.flatMap(lambda x: csv.reader(io.BytesIO(x))))
答案 1 :(得分:0)
您无需导入任何内容。下面提出的解决方案仅为演示目的创建第二个文件。您可以在修改后阅读该行,而无需将其写入任何位置。
with open(r'C:\Users\evkouni\Desktop\test_in.csv', 'r') as fin:
with open(r'C:\Users\evkouni\Desktop\test_out.csv', 'w') as fout:
cont = fin.readlines()
for line in cont[:-1]:
if line.count('"') % 2 == 1 and '"\n' not in line:
line = line.replace('\n', '')
fout.write(line)
#DEMO
#test_in.csv
#------------
#"A";"B";"C"
#1;"This is csv with
#newline";"This is another column"
#"This is newline
#test_out.csv
#------------
#"A";"B";"C"
#1;"This is csv with newline";"This is another column"
#"This is newline
如果您有什么不清楚的地方,请告诉我。
答案 2 :(得分:0)
如果您想使用换行符从csv创建数据帧并使用双引号引用而不重新创建轮子,那么请使用spark-csv和common-csv库:
from pyspark.sql import SQLContext
df = sqlContext.load(header="true",source="com.databricks.spark.csv", path = "hdfs://analytics.com.np:8020/hdp/badcsv.csv")