Question

我有一个文本文件，其中包含以下格式的几行：

name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school

我需要解析文本文件，并输出忽略转义逗号的字段输出。这些是像这样的字段2或3：

science, social
tennis, ping_pong, chess

我不知道如何忽略转义字符。如何在终端中使用awk或sed？

Answer 1

用您的记录通常不包含的字符（例如> command: apt-get install -y curl curl -SL https://downloads.wordpress.org/plugin/advanced-custom-fields.5.7.12.zip）替换\,，并在打印前将其还原。例如：

\n

由于首先对整个记录（即$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file science,social painting）执行gsub，所以awk被迫重新计算字段。但是第二个仅在第二个字段（即$0）上执行，因此不会影响其他字段。参见：Changing Fields。

要提取带有正确转义的逗号的多个字段，您需要使用for循环在所有字段中添加$2，如下例所示：

\n

另请参阅：What's the most robust way to efficiently parse CSV using awk?。

Answer 2

您可以将abc序列替换为文本中不会出现的另一个字符，将文本分隔为其余逗号，然后将所选字符替换为逗号：

abcx -> (No match, as no y)
abcy -> (No match, as no x)
abcyx -> (No match, not in correct order)

在这种情况下，请使用ASCII控件char“ Unit Separator” \ 31，我很确定您的输入将不包含。

您可以try it here。

Answer 3

当使用coreutils进行bash足够时，为什么awk和sed：

# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
     # read the \xff separated list into an array
     IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
     # read the \xff separated list into an array
     IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")

     echo "list_of_subjects : ${list_of_subjects[@]}"
     echo "list_of_sports   : ${list_of_sports[@]}"
done

将输出：

list_of_subjects : science social
list_of_sports   : football
list_of_subjects : painting
list_of_sports   : tennis ping_pong chess

请注意，这可能比使用awk的解决方案要慢。

请注意，其操作原理与其他答案相同-用其他唯一字符替换\,字符串，然后使用该字符遍历第二和第三字段元素。

Answer 4

您也许可以将列与函数连接起来。

function joincol(col,    i) {
    $col=$col FS $(col+1)
    for (i=col+1; i<NF; i++) {
        $i=$(i+1)
    }
    NF--
}

这可能会被使用：

{
    for (col=1; col<=NF; col++) {
        if ($col ~ /\\$/) {
            joincol(col)
        }
    }
}

请注意，递减NF是POSIX中未定义的行为。它可以删除最后一个字段，也可以不删除，但仍然符合POSIX。这在BSDawk和Gawk中对我有效。 YMMV。可能含有坚果。

Answer 5

使用gawk的FPAT：

awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess

然后使用gnusub替换反斜杠：

awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

Answer 6

使用Perl。将\,更改为控制字符，例如\x01，然后再次将其替换为,

$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt  | perl -F, -lane ' for(@F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess

Answer 7

这可能对您有用（GNU sed）：

sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file

用换行符替换引用的逗号，然后将换行符替换为逗号，将逗号替换为换行符。删除所有不包含逗号的行。删除空行。

使用awk或sed在文本文件中的一行中的反斜杠后忽略逗号

7 个答案: