Question

我的文件包含这样的格式的数据

<Tag1>content  
<Tag2>optional tag content  
<Tag3>content

<Tag1>other content  
<Tag3>other content

每个标记块代表填充对象所需的数据。有些标签也是可选的。

目前我正在使用此代码处理数据文件

#!/bin/bash
tag1=""
tag2=""
tag3=""

while read line; do

 if  [[ $line == '<Tag1>'* ]]
  then
   tag1=`echo $line | cut -c 6- | tr -d '\r'`
 elif  [[ $line == '<Tag2>'* ]]
  then
   tag2=`echo $line | cut -c 6- | tr -d '\r'`
 elif  [[ $line == '<Tag3>'* ]]
  then
   tag3=`echo $line | cut -c 6- | tr -d '\r'`
   #write new object to output file and reset tag variables
 fi

done <file.dat

其中cut获取标记后的数据，tr删除数据后的任何新行。

此代码非常慢，尤其是当您有数百个文件要处理数千行时。

是否有更快的方法来执行此操作并处理可选标记（如果没有，只是传递“”）与awk之类的东西？

编辑：

我用它来填充sql表，所以我使用输出来创建INSERT语句：

echo "INSERT INTO MyTable VALUES('$tag1','$tag2','$tag3');" >> output.sql

第二次编辑

给出

的输入

<Tag1>Some sample text including don't
<Tag2>http://google.com
<Tag3>$100

理想的输出是INSERT INTO MyTable值（“一些示例文本包括不要”，“http://google.com”，“$ 100”）;

显然，如果我要使用单引号传入值而不是引号，我必须在“不要”中加倍撇号，以便它不会提前转义输入。

Answer 1

从您的问题中不清楚，因为您没有显示预期的输出，但这可能是您正在寻找的：

$ cat tst.awk
BEGIN {
    RS = ""
    FS = "\n"
    fmt = "INSERT INTO MyTable VALUES(\047%s\047, \047%s\047, \047%s\047);\n"
}
{
    delete v
    for (i=1;i<=NF;i++) {
        tag = val = $i
        gsub(/^<|>.*/,"",tag)
        sub(/^[^>]+>/,"",val)
        v[tag] = val
    }
    printf fmt, v["Tag1"], v["Tag2"], v["Tag3"]
}

以下是您应该要求我们测试的输入文件类型，因为它包含一些传统上有问题的字符和字符串：

$ cat file
<Tag1>with 'single\' quotes
<Tag2>http://foo.com
<Tag3>trailing backslash\

<Tag1>With <some> "double\" quotes
<Tag3>with \1 backrefs & here

以下是给定输入的上述脚本产生的输出：

$ awk -f tst.awk file
INSERT INTO MyTable VALUES('with 'single\' quotes', 'http://foo.com', 'trailing backslash\');
INSERT INTO MyTable VALUES('With <some> "double\" quotes', '', 'with \1 backrefs & here');

如果其中任何一个不是您想要的，那么编辑您的问题以显示输入（或类似）以及您想要的输出。

Answer 2

#!/bin/bash regex="^<Tag([1-3])>(.*)$" while IFS= read -r line do if [[ $line =~ $regex ]] then case ${BASH_REMATCH[1]} in 1) tag1=${BASH_REMATCH[2]} ;; 2) tag2=${BASH_REMATCH[2]} ;; 3) echo "INSERT INTO MyTable VALUES('$tag1','$tag2','${BASH_REMATCH[2]}');" >> output.sql tag1= ; tag2= ;; esac fi done <file.dat解决方案可能会更快，但此Bash解决方案应该比原始代码更快：

multiStreamRecorder.start()

请注意，所有行都与相同的正则表达式匹配，1/2 / 3由case语句处理。显然，上面对标签内部或大/小写的空格非常敏感，所以如果你需要它来容忍变化，请考虑你的实际数据并对正则表达式进行任何必要的调整。

加快此bash脚本

2 个答案: