我有非常大的.csv文件包含原始数据。许多字段具有前导和尾随空格,并且许多字组/字之间只有一个空格的多字字段值具有额外的空格,例如。
'12 Anywhere Street'
应该是:
'12 Anywhere Street'
领先,尾随和额外空间从一个额外空间到六个额外空间不等。我可以将文件加载到我的数据库中并运行脚本来修剪它们。前导和尾随修剪脚本运行良好并快速执行;但是,删除单词之间多余空格的脚本更长,更耗时。在将原始.csv文件加载到我的数据库之前,最好使用命令行删除原始.csv文件中的单词之间的额外空格。
我基本上需要运行一个替换函数来替换“”到“”,“”,“”,......的任何实例,最多六个空格左右。我非常感谢为实现这一目标提供的一些帮助。
答案 0 :(得分:0)
In Part 1 of this response, I'll first assume that your CSV file has a field separator (say ",") that does NOT occur within any field. In Part 2, I'll deal with the more general case.
Part 1.
awk -F, '
function trim(s) {
sub(/^ */,"",s); sub(/ *$/,"",s); gsub(/ */," ",s); return s;
}
BEGIN {OFS=FS}
{for (i=1;i<=NF;i++) { $i=trim($i) }; print }'
Part 2.
To handle the general case, it's best to use a CSV-aware tool (such as Excel or one of the csv2tsv command-line tools) to convert the CSV to a simple format wherein the value-separator does not literally occur within the values. The TSV format (with tab-separated values) is particularly appropriate since it allows a representation of tabs to be included within fields.
Then run the above awk command using awk -F"\t"
instead of awk -F,
.
To recover the original format, use a tool such as Excel, tsv2csv, or jq. Here is the jq incantation assuming you want a "standard" CSV file:
jq -Rr 'split("\t") | @csv'
In a pinch, the following will probably be sufficient:
awk -F"\t" '
BEGIN{OFS=","; QQ="\"";}
function q(s) { if (index(s,OFS)) { return QQ s QQ }; return s}
function qq(s) { gsub( QQ, QQ QQ, s); return QQ s QQ }
function wrap(s) { if (index(s,QQ)) { return qq(s) } return q(s)}
{ s=wrap($1); for (i=2;i<=NF;i++) {s=s OFS wrap($i)}; print s}'
答案 1 :(得分:0)
On MacOS or Linux you can do:
cat data.csv | tr -s [:space:] > formatted.csv
This will not trim each value but will remove all duplicate spaces. Maybe this will get you going.