我正在处理类似下面的csv文件,以逗号分隔,每个单元格都用双引号括起来,但其中一些在双引号框内包含双引号和/或逗号。实际文件包含大约300列和200,000行。
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
我需要删除一些除非列,并合并最后几列,而不是在它们之间使用","
,我需要</br>
。并将第二列移到最后。单元格内的任何内容都应相同,并使用双引号和逗号作为原始文件。下面是我需要的输出示例。
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
在这个例子中,我想删除column3并合并第5,6,7列。
下面是我尝试使用的代码,但它正在读取双引号和/或逗号,这是行的结尾与我预期的不同。
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's@"</br>"@</br>@g' inputfile.csv
sed用于删除单元格的开头和结尾双引号。
我现在正在获取的输出文件,如果前一个字段包含双引号,则会认为这是单元格的开头,因此以下值通常会向上推送一列。
我使用的其他代码将每个逗号视为单元格的开头,因此也不会起作用。
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's@"</br>"@</br>@g' inputfile.csv
非常感谢任何帮助。谢谢!
答案 0 :(得分:2)
CSV是一种宽松的格式。格式化可能有细微差别。您的特定格式可能会或可能不会使用常规语法/正则表达式表达。 (有关此问题的讨论,请参阅this question。)即使您的特定格式可以用正则表达式表达,也可以更容易从现有库中剔除解析器。
它不是您可能想要或需要的bash / awk / sed解决方案,但Python有一个csv
模块用于解析CSV文件。有许多选项可以调整格式。尝试这样的事情:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
请注意,在Python中,索引从0开始(例如row[1]
是第二个字段)。切片的第一个索引是包含的,最后一个是独占的(row[1:3]
仅row[1]
和row[2]
)。您的格式似乎需要在每个字段周围引用,因此quoting=csv.QUOTE_ALL
。 Dialects and Formatting Parameters还有更多选项。
上面的代码产生以下输出:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
这有两个问题:
它不会以不同的方式处理第一行,因此第5,6和7列的标题与其他行合并。
您的输入CSV包含"some other, "cde" here"
(第三行,第四列),cde
周围带有未转义的引号。在第二行还有另一种情况,但它已被删除,因为它在第3列。结果包含错误的引号。
如果这些引号被正确转义,则您的样本输入CSV文件将变为
infile.csv (转义引号):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
现在考虑这个修改过的Python脚本,它不会合并第一行的列:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
输出 outfile.csv 是
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
这是您的示例输出,但已正确转义"some other, ""cde"" here"
。
这可能不是你想要的,不是sed或awk解决方案,但我希望它仍然有用。处理更复杂的格式可能会证明更复杂的工具。使用现有的库也可以消除一些出错的机会。
答案 1 :(得分:0)
这可能是对问题的过度简化,但这对我的测试数据起了作用:
cat /tmp/inputfile.csv | sed 's@\"\,\"@|@g' | sed 's@"</br>"@</br>@g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
请注意我在Mac上可能是因为我必须用引号将AWK脚本中的逗号包装起来。