在回复 Recursive Descent CSV parser in BASH 时,我(两个帖子中the original author)已尝试将其转换为AWK脚本,以便快速比较数据处理与这些脚本语言。由于有几个缓解因素,翻译不是1:1的翻译,但对于那些感兴趣的人来说,这种实现在字符串处理方面比在另一方面更快。
最初,由于Jonathan Leffler,我们有一些问题已被撤销。虽然标题为CSV
,但我们已将代码更新为DSV
,这意味着如果您认为有必要,可以将任何单个字符指定为字段分隔符。
此代码现已准备好摊牌。
基本功能
"
[1] 引用字段是文字内容,因此不会对引用内容执行转义序列解释。然而,可以在单个字段中连接引号,纯文本和解释序列以实现期望的效果。例如:
one,two,three:\t"Little Endians," and one Big Endian Chief
是CSV的三字段行,其中第三个字段相当于:
three: Little Endians, and one Big Endian Chief
[2] 参考资料中描述的例子为"具体实施"或拥有"未定义的行为"将不受支持,因为它们根据定义不可移植,或者太模糊而不可靠。如果此处或参考材料中未定义转义序列,则将忽略反斜杠,并将单个最后面的字符视为纯文本值。不支持整数值字符转义序列它是一种不可靠的方法,不能在多个平台上很好地扩展,并且不必要地增加了验证代理解析的复杂性。
[3] 八进制字符转义必须采用3位八进制格式。如果它不是3位八进制转义序列,则它是单个数字的空转义序列。十六进制转义序列必须采用2位十六进制格式。如果转义序列标识符后面的前两个字符无效,则不会进行解释,并且将在标准错误上打印消息。任何剩余的十六进制数字都将被忽略。
[4] 自定义输入分隔符iDelimiter
必须是单个字符。不支持多行记录,并且总是不赞成使用这种矛盾。这降低了数据记录的可移植性,使其特定于文件的位置和原点(在该文件中)可能是未知的。例如,grep
内容文件可能会返回不完整的记录,因为内容可能在任何前一行开始,将数据采集限制为完全自上而下的数据库解析。
[5] 自定义输出分隔符oDelimiter
可以是任何所需的字符串值。脚本输出始终由单个换行符终止。这是正确的终端应用程序输出的功能。否则,您解析的CSV输出和终端提示将消耗相同的行,从而产生令人困惑的情况。此外,大多数解释器(如控制台)都是基于行的设备,他们希望换行符表示I / O记录的结束。如果您发现尾随换行符不合适,请将其修剪掉。
[6] 16位Unicode转义序列可通过以下表示法获得:
\uHHHH Unicode character with hex value HHHH (4 digits)通过以下方式支持
和32位Unicode转义序列:
\UHHHHHHHH Unicode character with hex value HHHHHHHH (8 digits)
特别感谢SO社区的所有成员,他们的经验,时间和投入使我创建了一个非常有用的信息处理工具。
代码清单:dsv.awk
#!/bin/awk -f
#
###############################################################
#
# ZERO LIABILITY OR WARRANTY LICENSE YOU MAY NOT OWN ANY
# COPYRIGHT TO THIS SOFTWARE OR DATA FORMAT IMPOSED HEREIN
# THE AUTHOR PLACES IT IN THE PUBLIC DOMAIN FOR ALL USES
# PUBLIC AND PRIVATE THE AUTHOR ASKS THAT YOU DO NOT REMOVE
# THE CREDIT OR LICENSE MATERIAL FROM THIS DOCUMENT.
#
###############################################################
#
# Special thanks to Jonathan Leffler, whose wisdom, and
# knowledge defined the output logic of this script.
#
# Special thanks to GNU.org for the base conversion routines.
#
# Credits and recognition to the original Author:
# Triston J. Taylor whose countless hours of experience,
# research and rationalization have provided us with a
# more portable standard for parsing DSV records.
#
###############################################################
#
# This script accepts and parses a single line of DSV input
# from <STDIN>.
#
# Record fields are seperated by command line varibale
# 'iDelimiter' the default value is comma.
#
# Ouput is seperated by command line variable 'oDelimiter'
# the default value is line feed.
#
# To learn more about this tool visit StackOverflow.com:
#
# http://stackoverflow.com/questions/10578119/
#
# You will find there a wealth of information on its
# standards and development track.
#
###############################################################
function NextSymbol() {
strIndex++;
symbol = substr(input, strIndex, 1);
return (strIndex < parseExtent);
}
function Accept(query) {
#print "query: " query " symbol: " symbol
if ( symbol == query ) {
#print "matched!"
return NextSymbol();
}
return 0;
}
function Expect(query) {
# special case: empty query && symbol...
if ( query == nothing && symbol == nothing ) return 1;
# case: else
if ( Accept(query) ) return 1;
msg = "dsv parse error: expected '" query "': found '" symbol "'";
print msg > "/dev/stderr";
return 0;
}
function PushData() {
field[fieldIndex++] = fieldData;
fieldData = nothing;
}
function Quote() {
while ( symbol != quote && symbol != nothing ) {
fieldData = fieldData symbol;
NextSymbol();
}
Expect(quote);
}
function GetOctalChar() {
qOctalValue = substr(input, strIndex+1, 3);
# This isn't really correct but its the only way
# to express 0-255. On unicode systems it won't
# matter anyway so we don't restrict the value
# any further than length validation.
if ( qOctalValue ~ /^[0-7]{3}$/ ) {
# convert octal to decimal so we can print the
# desired character in POSIX awks...
n = length(qOctalValue)
ret = 0
for (i = 1; i <= n; i++) {
c = substr(qOctalValue, i, 1)
if ((k = index("01234567", c)) > 0)
k-- # adjust for 1-basing in awk
ret = ret * 8 + k
}
strIndex+=3;
return sprintf("%c", ret);
# and people ask why posix gets me all upset..
# Special thanks to gnu.org for this contrib..
}
return sprintf("\0"); # if it wasn't 3 digit octal just use zero
}
function GetHexChar(qHexValue) {
rHexValue = HexToDecimal(qHexValue);
rHexLength = length(qHexValue);
if ( rHexLength ) {
strIndex += rHexLength;
return sprintf("%c", rHexValue);
}
# accept no non-sense!
printf("dsv parse error: expected " rHexLength) > "/dev/stderr";
printf("-digit hex value: found '" qHexValue "'\n") > "/dev/stderr";
}
function HexToDecimal(hexValue) {
if ( hexValue ~ /^[[:xdigit:]]+$/ ) {
# convert hex to decimal so we can print the
# desired character in POSIX awks...
n = length(hexValue)
ret = 0
for (i = 1; i <= n; i++) {
c = substr(hexValue, i, 1)
c = tolower(c)
if ((k = index("0123456789", c)) > 0)
k-- # adjust for 1-basing in awk
else if ((k = index("abcdef", c)) > 0)
k += 9
ret = ret * 16 + k
}
return ret;
# and people ask why posix gets me all upset..
# Special thanks to gnu.org for this contrib..
}
return nothing;
}
function BackSlash() {
# This could be optimized with some constants.
# but we generate the data here to assist in
# translation to other programming languages.
if (symbol == iDelimiter) { # separator precedes all sequences
fieldData = fieldData symbol;
} else if (symbol == "a") { # alert
fieldData = sprintf("%s\a", fieldData);
} else if (symbol == "b") { # backspace
fieldData = sprintf("%s\b", fieldData);
} else if (symbol == "f") { # form feed
fieldData = sprintf("%s\f", fieldData);
} else if (symbol == "n") { # line feed
fieldData = sprintf("%s\n", fieldData);
} else if (symbol == "r") { # carriage return
fieldData = sprintf("%s\r", fieldData);
} else if (symbol == "t") { # horizontal tab
fieldData = sprintf("%s\t", fieldData);
} else if (symbol == "v") { # vertical tab
fieldData = sprintf("%s\v", fieldData);
} else if (symbol == "0") { # null or 3-digit octal character
fieldData = fieldData GetOctalChar();
} else if (symbol == "x") { # 2-digit hexadecimal character
fieldData = fieldData GetHexChar( substr(input, strIndex+1, 2) );
} else if (symbol == "u") { # 4-digit hexadecimal character
fieldData = fieldData GetHexChar( substr(input, strIndex+1, 4) );
} else if (symbol == "U") { # 8-digit hexadecimal character
fieldData = fieldData GetHexChar( substr(input, strIndex+1, 8) );
} else { # symbol didn't match the "interpreted escape scheme"
fieldData = fieldData symbol; # just concatenate the symbol
}
NextSymbol();
}
function Line() {
if ( Accept(quote) ) {
Quote();
Line();
}
if ( Accept(backslash) ) {
BackSlash();
Line();
}
if ( Accept(iDelimiter) ) {
PushData();
Line();
}
if ( symbol != nothing ) {
fieldData = fieldData symbol;
NextSymbol();
Line();
} else if ( fieldData != nothing ) PushData();
}
BEGIN {
# State Variables
symbol = ""; fieldData = ""; strIndex = 0; fieldIndex = 0;
# Output Variables
field[itemIndex] = "";
# Control Variables
parseExtent = 0;
# Formatting Variables (optionally set on invocation line)
if ( iDelimiter != "" ) {
# the algorithm in place does not support multi-character delimiter
if ( length(iDelimiter) > 1 ) { # we have a problem
msg = "dsv parse: init error: multi-character delimiter detected:";
printf("%s '%s'", msg, iDelimiter);
exit 1;
}
} else {
iDelimiter = ",";
}
if ( oDelimiter == "" ) oDelimiter = "\n";
# Symbol Classes
nothing = "";
quote = "\"";
backslash = "\\";
getline input;
parseExtent = (length(input) + 2);
# parseExtent exceeds length because the loop would terminate
# before parsing was complete otherwise.
NextSymbol();
Line();
Expect(nothing);
}
END {
if (fieldIndex) {
fieldIndex--;
for (i = 0; i < fieldIndex; i++)
{
printf("%s", field[i] oDelimiter);
}
print field[i];
}
}
如何运行脚本&#34;像Pro&#34;
# Spit out some CSV "newline" delimited:
echo 'one,two,three,AWK,CSV!' | awk -f dsv.awk
# Spit out some CSV "tab" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v oDelimiter=$'\t' -f dsv.awk
# Spit out some CSV "ASCII Group Separator" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v oDelimiter=$'\29' -f dsv.awk
如果您需要一些自定义输出控制分隔符但不确定要使用的内容,可以参考 this handy ASCII chart
未来计划:
哲学
应始终使用转义序列在基于行的数据库中创建多行字段数据,并且应始终使用引号来保留和连接记录字段内容。这是实现此类记录解析器的最简单(也是最有效)的方法。我鼓励所有的软件开发人员和教育机构采取这种方向,以确保可移植性和准确获取基于行的分隔符分隔记录。
除了RFC 4180之外,CSV没有官方规范,也没有定义任何有用的便携式记录类型。我希望作为一名拥有超过15年经验的开发人员,这将成为Portable CSV / DSV Records的官方认可标准。答案 0 :(得分:1)
原始版本的代码中有太多空白行,这使得难以阅读。修改后的代码减少了空行,更容易阅读;相关的行是可以一起读取的块。感谢。
awk
就像C;它将0视为false,将任何非零视为true。因此,任何大于0的东西都是正确的,但是小于0的东西也是如此。
在标准awk
中,没有直接打印到stderr
的方法。 GNU AWK记录了print "message" > "/dev/stderr"
(名称为字符串!)的使用,并暗示它甚至可以在没有实际设备的系统上工作。在具有awk
设备的系统上,它也适用于标准/dev/stderr
。
用于处理数组中每个索引的awk
惯用法是for (i in array) { ... }
。然而,
因为你有一个索引itmIndex
,告诉你数组中有多少项,你应该使用
for (i = 0; i < itmIndex; i++) { printf("%s%s", item[i], delim); }
然后在结尾处输出换行符。这对我的思维方式来说只有一个分隔符,但这是bash
代码正在做的事情的转录。我通常的诀窍是:
pad = ""
for (i = 0; i < itmIndex; i++)
{
printf("%s%s", pad, item[i])
pad = delim
}
print "";
您可以使用-v var=value
将变量传递到脚本中(或省略-v
)。请参阅之前列出的POSIX URL。