将线分割成多个通过分割特定字段

时间:2013-08-25 11:17:32

标签: bash sed awk split

我有多行,如:

"390";"902";"from 4670000 to 4679999, from 4680000 to 4689999, from 9960000 to 9969999";"something1";"something2";"20.09.04"
"390";"903";"from 0770000 to 0779999, from 9170000 to 9179999";"something3";"something4";"09.09.04"

我需要的是:

"390";"902";"from 4670000 to 4679999";"something1";"something2";"20.09.04"
"390";"902";"from 4680000 to 4689999";"something1";"something2";"20.09.04"
"390";"902";"from 9960000 to 9969999";"something1";"something2";"20.09.04"
"390";"903";"from 0770000 to 0779999";"something3";"something4";"09.09.04"
"390";"903";"from 9170000 to 9179999";"something3";"something4";"09.09.04"

正如你所看到的,我需要将变量3从/到标签分开(注意“...”之间有时会有空格。)

理想情况下,我需要产生结果:

"390";"902";"4670000";"4679999";"something1";"something2";"20.09.04"
"390";"902";"4680000";"4689999";"something1";"something2";"20.09.04"
"390";"902";"9960000";"9969999";"something1";"something2";"20.09.04"
"390";"903";"0770000";"0779999";"something3";"something4";"09.09.04"
"390";"903";"9170000";"9179999";"something3";"something4";"09.09.04"

我已经发现我可以通过awk进行拆分,但我不确定如何复制其余部分:

awk -F\, '{                       
  for (i = 0; ++i <= NF;)
    print i, $i
  }' <<<'from 4670000 to 4679999, from 4680000 to 4689999, from 9960000 to 9969999'
1 from 4670000 to 4679999
2  from 4680000 to 4689999
3  from 9960000 to 9969999

对不起,这是我在这里的第一个问题,请随时指出我应该如何纠正它以便完全回答。

谢谢!

7 个答案:

答案 0 :(得分:4)

输入:

"390";"902";"from 4670000 to 4679999, from 4680000 to 4689999, from 9960000 to 9969999";"something1";"something2";"20.09.04"
"390";"903";"from 0770000 to 0779999, from 9170000 to 9179999";"something3";"something4";"09.09.04"

此代码

#!/usr/bin/awk -f

BEGIN {
    FS = ";"
}

{
    t = $3
    gsub(/"/, "", t)
    n = split(t, a, /, /)
    for (i = 1; i <= n; ++i) {
        print $1 ";" $2 ";\"" a[i] "\";" $4 ";" $5 ";" $6
    }
}

会给予

"390";"902";"from 4670000 to 4679999";"something1";"something2";"20.09.04"
"390";"902";"from 4680000 to 4689999";"something1";"something2";"20.09.04"
"390";"902";"from 9960000 to 9969999";"something1";"something2";"20.09.04"
"390";"903";"from 0770000 to 0779999";"something3";"something4";"09.09.04"
"390";"903";"from 9170000 to 9179999";"something3";"something4";"09.09.04"

简洁形式(我认为它不能真正被称为真正的“单行”):

awk -F ";" -- '{ t = $3; gsub(/"/, "", t); n = split(t, a, /, /); for (i = 1; i <= n; ++i) print $1 ";" $2 ";\"" a[i] "\";" $4 ";" $5 ";" $6 }'

这段代码

#!/usr/bin/awk -f

BEGIN {
    FS = ";"
}

{
    t = $3
    gsub(/"|from /, "", t)
    n = split(t, a, /, | to /)
    for (i = 1; i <= n; i += 2) {
        print $1 ";" $2 ";\"" a[i] "\";\"" a[i + 1] "\";"$4 ";" $5 ";" $6
    }
}

会给予

"390";"902";"4670000";"4679999";"something1";"something2";"20.09.04"
"390";"902";"4680000";"4689999";"something1";"something2";"20.09.04"
"390";"902";"9960000";"9969999";"something1";"something2";"20.09.04"
"390";"903";"0770000";"0779999";"something3";"something4";"09.09.04"
"390";"903";"9170000";"9179999";"something3";"something4";"09.09.04"

简明形式:

awk -F ";" -- '{ t = $3; gsub(/"|from /, "", t); n = split(t, a, /, | to /); for (i = 1; i <= n; i += 2) print $1 ";" $2 ";\"" a[i] "\";\"" a[i + 1] "\";"$4 ";" $5 ";" $6; }'

使用gawk,nawk和mawk测试脚本。

答案 1 :(得分:3)

awk one-liner:

awk -F'";"' -v OFS='";"' '{n=split($3,a,/,\s*/);for(i=1;i<=n;i++){$3=a[i];print}}' file

输出:

kent$  cat f
"390";"902";"from 4670000 to 4679999, from 4680000 to 4689999, from 9960000 to 9969999";"something1";"something2";"20.09.04"
"390";"903";"from 0770000 to 0779999, from 9170000 to 9179999";"something3";"something4";"09.09.04"

kent$  awk -F'";"' -v OFS='";"' '{n=split($3,a,/,\s*/);for(i=1;i<=n;i++){$3=a[i];print}}' f
"390";"902";"from 4670000 to 4679999";"something1";"something2";"20.09.04"
"390";"902";"from 4680000 to 4689999";"something1";"something2";"20.09.04"
"390";"902";"from 9960000 to 9969999";"something1";"something2";"20.09.04"
"390";"903";"from 0770000 to 0779999";"something3";"something4";"09.09.04"
"390";"903";"from 9170000 to 9179999";"something3";"something4";"09.09.04"

修改

如果你想要解析from...to,仍然是一个awk oneliner:

awk -F'";"' -v OFS='";"' '{n=split($3,a,/,\s*/);for(i=1;i<=n;i++)
{$3=a[i];sub(/\s*to\s*/,"\";\"",$3);sub(/\s*from\s*/,"",$3);print}}' file

使用相同的输入文件进行测试:

kent$  awk -F'";"' -v OFS='";"' '{n=split($3,a,/,\s*/);for(i=1;i<=n;i++){$3=a[i];sub(/\s*to\s*/,"\";\"",$3);sub(/\s*from\s*/,"",$3);print}}' f                              
"390";"902";"4670000";"4679999";"something1";"something2";"20.09.04"
"390";"902";"4680000";"4689999";"something1";"something2";"20.09.04"
"390";"902";"9960000";"9969999";"something1";"something2";"20.09.04"
"390";"903";"0770000";"0779999";"something3";"something4";"09.09.04"
"390";"903";"9170000";"9179999";"something3";"something4";"09.09.04"

答案 2 :(得分:2)

$ cat tst.awk
BEGIN{ FS=OFS="\";\"" }
{
    gsub(/from /,"",$3)
    split($3,a,/ *, */)
    for (i=1;i in a;i++) {
        $3 = a[i]
        sub(/ to /,OFS,$3)
        print
    }
}
$
$ awk -f tst.awk file
"390";"902";"4670000";"4679999";"something1";"something2";"20.09.04"
"390";"902";"4680000";"4689999";"something1";"something2";"20.09.04"
"390";"902";"9960000";"9969999";"something1";"something2";"20.09.04"
"390";"903";"0770000";"0779999";"something3";"something4";"09.09.04"
"390";"903";"9170000";"9179999";"something3";"something4";"09.09.04"

答案 3 :(得分:2)

这可能适合你(GNU sed):

sed -r 's/, /","/g;s/^(([^;]*;){2})([^,]*),([^;]*)(.*)/\1\3\5\n\1\4\5/;P;D' file

答案 4 :(得分:1)

#!/bin/bash

filename='file.txt'
temp=$(mktemp)

sed 's/, */";"/g' "$filename" > "$temp" # replace commas with ;

echo -n > "$filename" # clear our file
while read line; do
    IFS=';' read -a fields <<< "$line" # make an array out of the string

    for ((i=2; i<${#fields[@]}-3; i++)); do
        from=$(echo "${fields[$i]}" | cut -d ' ' -f2)
        to=$(echo "${fields[$i]}" | cut -d ' ' -f4)
        echo "${fields[0]};${fields[1]};\"$from\";\"$to;${fields[-3]};${fields[-2]};${fields[-1]}" >> "$filename"
    done
done < "$temp"

rm "$temp"

exit 0

它也会在逗号之前处理空格。

答案 5 :(得分:1)

这是在Bash中执行此操作的另一种方法:

#!/bin/bash

shopt -s extglob

IFS=';'

while read -a FIELDS; do
    TEMP=${FIELDS[2]//\"}
    read -a RANGES <<< "${TEMP//,?( )/;}"
    for A in "${RANGES[@]}"; do
        echo "${FIELDS[0]};${FIELDS[1]};\"$A\";${FIELDS[*]:3}"
    done
done

使用

运行
bash script.sh < file

这将给出第一个预期的输出。

或者

#!/bin/bash

shopt -s extglob

IFS=';'

while read -a FIELDS; do
    TEMP=${FIELDS[2]//@(\"|from )}
    read -a RANGES <<< "${TEMP//@(,?( )| to )/;}"
    for (( I = 0; I < ${#RANGES[@]}; I += 2 )); do
        echo "${FIELDS[0]};${FIELDS[1]};\"${RANGES[I]}\";\"${RANGES[I + 1]}\";${FIELDS[*]:3}"
    done
done

哪个会获得第二个预期输出。

答案 6 :(得分:0)

以下是使用的一种方法。我知道你没有标记它,但我似乎更容易用一个好的解析器处理csv文件。它用逗号分割第三个字段(row[2]),之后它在空格中分割该字段的每个字符串并提取奇数字段(v.split()[1::2])。

script.py的内容:

#!/usr/bin/env python3

import csv
import sys
import copy

with open(sys.argv[1], 'r') as f:
        csvfile = csv.reader(f, delimiter=';')
        csvout = csv.writer(sys.stdout, delimiter=';', quoting=csv.QUOTE_ALL)
        for row in csvfile:
                v3 = row[2].split(r', ')
                for v in v3:
                        newrow = copy.deepcopy(row)
                        fields = v.split()[1::2]
                        newrow[2:3] = fields
                        csvout.writerow(newrow)

像以下一样运行:

python3 script.py infile

产量:

"390";"902";"4670000";"4679999";"something1";"something2";"20.09.04"
"390";"902";"4680000";"4689999";"something1";"something2";"20.09.04"
"390";"902";"9960000";"9969999";"something1";"something2";"20.09.04"
"390";"903";"0770000";"0779999";"something3";"something4";"09.09.04"
"390";"903";"9170000";"9179999";"something3";"something4";"09.09.04"