如何使用使用字符串命名的文件将大型连接xml文件拆分为单个xml文件?
input.xml中
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
我想读取字符串file="xxxx-yyyyyyyy.XML"
并创建名为xxxx.XML的输出文件
输出xml文件:
1001.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
1002.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
1008.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
我的首选是使用bash shell工具,如cat,awk,sed和xml工具,如xmllint或类似工具,并将stdout和stderr记录到日志文件中。
欣赏方法和可测试的解决方案
答案 0 :(得分:1)
考虑以下 gawk 方法(如果您的输入按问题逐行构建):
awk '/<?xml version/{ getline dt; getline typedoc;
if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS dt ORS typedoc > fn; next;
}}{ print > fn }
' input.xml 2> err.log
结果:
cat 1001.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
cat 1002.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
cat 1008.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
/<?xml version/
- 遇到行/<?xml version/
并xml
声明
getline dt;
- 使用<!DOCTYPE
getline typedoc;
- 使用type-of-doc
标记
if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a))
- 匹配file
属性值
第一个捕获的组([0-9]+)
将被分配给第一个数组元素a[1]