Question

我有几个压缩的xml文件，其中包含存储在Trans xml标记中的事务。其中一些事务（并非全部！）包含<customer> xml标记，其中包含custnumber xml标记。我需要在xml文件中匿名此客户编号，以便将xml文件用于开发目的。 xml的结构是这样的：

<transactions>
  <trans>
  ...
    <customer>
    ...
      <customer_custnumber>123456789</customer_custnumber>
    ...
    </customer>
    ...
  </tran>
</transactions>

由于我的下游流程，我需要在哈希时保留客户编号的最大长度。为此，我使用java编写了一个工具，它将客户编号散列为特定数字范围内的唯一散列。

我的第一种方法是使用读取xml文件中的所有客户编号，并在我调用哈希工具的每一次出现时使用。这个问题已经过去了，因为我每个文件调用java工具5000次，每个文件的运行时间为5-6分钟（我每天有> 40个文件）。

我的秒方法是使用zgrep和awk按照它们在xml文件中出现的顺序提取所有客户编号，将它们写入文本文件并运行我的java工具来哈希每一行文件。这个速度要快得多，因为5000个数字只用了几秒钟。但现在我的问题是用文本文件中的散列值替换客户编号的原始值。我知道它们是有序的，因此xml文件中的第一个出现与文本文件中的第一个哈希相关，依此类推。但是我现在如何替换它呢？

这是我目前的代码：

#!/bin/bash

tempFile=cardNumber_tmp.txt
hashedTempFile=hashed_cardNumber_tmp.txt

for file in ${DIR_SRC}/input.xml.zip ; do
    declare listOfIds
    listOfIds=$(zgrep "<customer_custnumber>" $file | awk -F">" '{print $2}' | awk -F"<" '{print $1}') 
    # $listOfIds contains all Ids separated whitespaces
    # use tr to replace whitespace with newline 
    echo $listOfIds | tr " " "\n" > ${DIR_TEMP}/${tempFile} 
    # call HashCustNumber.jar for tempFile with type customer
    java -jar HashCustNumber.jar "${DIR_TEMP}/${tempFile}" "customer"
    # HashCustNumber.jar writes result into $hashedTempFile
    declare -a arr
    readarray -t arr < "${DIR_TEMP}/${hashedTempFile}"
    # Array arr contains Hashes without newline

    # ??

done

# delete tempFiles
rm ${DIR_TEMP}/${tempFile} 
rm ${DIR_TEMP}/${hashedTempFile}

我还读到我不应该使用sed或awk从xml文件中提取数据而我无法使用xmlstarlet，因为它没有安装在我的公司服务器上。有没有想法如何用散列值替换值，这种方式不涉及散列程序的数千次调用？

Answer 1

由于哈希函数是用java编写的，因此使用zip api和jaxp / sax在java中执行整个过程效率更高。

否则昂贵的是为每个id启动一个jvm。此外，zgrep只是一个使用gzip和grep的脚本，不应该与.zip文件一起使用。

首先解压缩到工作目录，最后重新压缩
然后假设您的ID列表位于文件"$num_file"中，要修改解压缩的文件是"$xml_file"和目标文件"$new_xml_file";可以使用以下perl命令完成替换

perl -pe '
    BEGIN{
            $num_file = shift @ARGV;
            {
                local @ARGV = $num_file;
                @ids = map {chomp;$_} <>;
            }
        }
        s/<customer_custnumber>\K[^<]*/shift @ids/e
' "$num_file" "$xml_file" > "$new_xml_file"

请注意，知道之前的awk命令用于提取数字，因为perl使用相同的表达式，所以“ as ”是安全的

如何使用散列值替换特定xml标记的值？

1 个答案: