如何拆分连接的xml文件并使用字符串

时间:2017-06-04 07:03:56

标签: xml bash awk sed text-processing

如何使用使用字符串命名的文件将大型连接xml文件拆分为单个xml文件?

input.xml中

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>

我想读取字符串file="xxxx-yyyyyyyy.XML"并创建名为xxxx.XML的输出文件

输出xml文件:

1001.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>

1002.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>

1008.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>

我的首选是使用bash shell工具,如cat,awk,sed和xml工具,如xmllint或类似工具,并将stdout和stderr记录到日志文件中。

欣赏方法和可测试的解决方案

1 个答案:

答案 0 :(得分:1)

考虑以下 gawk 方法(如果您的输入按问题逐行构建):

awk '/<?xml version/{ getline dt; getline typedoc; 
     if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) { 
         fn=a[1]".xml"; print $0 ORS dt ORS typedoc > fn; next; 
     }}{ print > fn }
' input.xml 2> err.log

结果:

cat 1001.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
cat 1002.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
cat 1008.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
  • /<?xml version/ - 遇到行/<?xml version/xml声明

  • getline dt; - 使用<!DOCTYPE

  • 捕获下一行
  • getline typedoc; - 使用type-of-doc标记

  • 捕获下一行
  • if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) - 匹配file属性值

  • 第一个捕获的组([0-9]+)将被分配给第一个数组元素a[1]