从外部来源,我得到巨大的CSV文件(大约16GB),其字段可选地用双引号(“)括起来。字段用分号(;)分隔。当字段在内容中包含双引号时,它被作为两个双引号转义。
目前,我正在将这些数据导入MySQL数据库,该数据库理解""
的语义。
我正在考虑迁移到Amazon Redshift,但他们(或者可能是PostgreSQL)要求使用反斜杠转义为\"
。
现在我正在搜索最快的命令行工具(可能是awk,sed?)以及转换文件的确切语法。
示例输入:
"""start of line";"""beginning "" middle and end """;"end of line"""
12345;"Tell me an ""intelligent"" joke; I tell you one in return"
54321;"Your mom is ""nice"""
"";"";""
"However, if;""Quotes""; are present"
示例输出:
"\"start of line";"\"beginning \" middle and end \"";"end of line\""
12345;"Tell me an \"intelligent\" joke; I tell you one in return"
54321;"Your mom is \"nice\""
"";"";""
"However, if;\"Quotes\"; are present"
修改:添加了更多测试。
答案 0 :(得分:3)
需要注意几个边缘情况:
sed -r '
# at the start of a line or the start of a field,
# replace """ with "\"
s/(^|;)"""/\1"\\"/g
# replace any doubled double-quote with an escaped double-quote.
# this affects any "inner" quote pair as well as end of field or end of line
# if there is an escaped quote from the previous command, don't be fooled by
# a proceeding quote.
s/([^\\])""/\1\\"/g
# the above step will destroy empty strings. fix them here. this uses a
# conditional loop: if there are 2 consecutive empty fields, they will
# share a delimited, so we have to process the line more than once
:fix_empty_fields
s/(^|;)\\"($|;)/\1""\2/g
tfix_empty_fields
' <<'END'
"""start of line";"""beginning "" middle and end """;"end of line"""
"";"";"";"""";"""""";"";""
END
"\"start of line";"\"beginning \" middle and end \"";"end of line\""
"";"";"";"\"";"\"\"";"";""
Sed是一种高效的工具,但是16GB文件需要一段时间。并且你最好有至少16GB的可用磁盘空间来写入更新的文件(即使是sed的-i
inplace-edit在幕后使用临时文件)
答案 1 :(得分:0)
我会使用sed
,正如您在帖子中所建议的那样:
$ sed 's@""@\\"@g' input
12345;"Tell me an \"intelligent\" joke; I tell you one in return"
54321;"Your mom is \"nice\""
答案 2 :(得分:0)
我会选择使用sed:
$ sed 's:"":\\":g' your_csv.csv
在以下测试时:
new """
test ""
"hows "" this "" "
我得到了:
new \""
test \"
"hows \" this \" "
答案 3 :(得分:0)
此行应该有效:
sed 's/""/\\"/g' file
答案 4 :(得分:0)
使用sed
:
sed 's/""/\\"/g' input_file
$ cat n.txt
12345;"Tell me an ""intelligent"" joke; I tell you one in return"
54321;"Your mom is ""nice"""
$ sed 's/""/\\"/g' n.txt
12345;"Tell me an \"intelligent\" joke; I tell you one in return"
54321;"Your mom is \"nice\""