假设我们有一个逗号分隔文件(csv),如下所示:
"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The "day" when earth stood still","Michael Rennie,the 'strong' man","robert wise","1951"
"the 'gladiator'","russel "the awesome" crowe","ridley scott","2000"
从上面可以看出,在第4行和第4行中5引号内有引号。 输出应该如下所示:
"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951"
"the gladiator","russel the awesome crowe","ridley scott","2000"
如何摆脱csv文件中这样的引号内引用的引号(单引号和双引号)。请注意,单个字段中的逗号是可以的,因为解析器会识别它在引号内并将其作为一个字段。这只是安排csv文件的预处理步骤,以便可以将其转换为多个解析器以转换为我们想要的任何格式。 Bash,awk,python都可以。请不要perl,我厌倦了那种语言:D 提前谢谢!
答案 0 :(得分:3)
怎么样
import csv
def remove_quotes(s):
return ''.join(c for c in s if c not in ('"', "'"))
with open("fixquote.csv","rb") as infile, open("fixed.csv","wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for line in reader:
writer.writerow([remove_quotes(elem) for elem in line])
产生
~/coding$ cat fixed.csv
"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951"
"the gladiator","russel the awesome crowe","ridley scott","2000"
顺便说一下,你可能想查看其中一些名字的拼写..
答案 1 :(得分:0)
将值拆分为数组。通过数组迭代删除除第一个和最后一个字符之外的任何引号。希望它有所帮助。
答案 2 :(得分:0)
使用awk,您可以执行以下操作:
awk -v Q='"' '{ gsub("[\"']","") ; gsub(",",Q "," Q) ; print Q $0 Q }'