Question

我有一个看起来像这样的.xml文件，

此下面的另外一千行

<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 1.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 2.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 3.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+

我试图将数据与数据分开并将数据移动到多个文件

我尝试了下面的代码，但它是第一部分，我无法移动第二和第三部分等。

 sed -En 'H;$!d
        g;s/.*[\n](.*<note>.*\n.*<note>[^\n]*).*/\1/p
        ' sample.xml > 1.xml

请帮我解决这个问题。

提前致谢...

Answer 1

不要使用正则表达式，也不要使用正确的XML / HTML解析器和强大的xpath查询：

for i in {1..3}; do
    xmllint  --xpath "//note[$i]" file > $i.xml
done

理论：

根据编译理论，无法使用基于finite state machine的正则表达式解析HTML。由于HTML的层次结构，您需要使用pushdown automaton并使用LALR等工具操作YACC语法。

在shell中的realLife©®™日常工具：

您可以使用以下其中一项：

xmllint通常默认使用libxml2，xpath1安装（检查my wrapper以使换行符分隔输出

xmlstarlet可以编辑，选择，转换......默认情况下不安装，xpath1

通过perl的模块XML :: XPath，xpath1

安装

xpath

xidel xpath3

saxon-lint我自己的项目，包装在@Michael Kay的Saxon-HE Java库中，xpath3

或者您可以使用高级语言和正确的库，我想：

python的lxml（from lxml import etree）

perl的XML::LibXML，XML::XPath，XML::Twig::XPath，HTML::TreeBuilder::XPath

ruby nokogiri，check this example

php DOMXpath，check this example

检查：Using regular expressions with HTML tags

Answer 2

试试这个（这个解决方案假设你每6行都有数据）：

c=1; while read l1 && read l2 && read l3 && read l4 && read l5 && read l6; do echo -e "$l1\n$l2\n$l3\n$l4\n$l5\n$l6\n" > ${c}.xml; ((c++)); done < big.xml; echo; find . | grep "[1-9]*.xml$"; echo; grep . [1-9]*.xml

./3.xml
./2.xml
./1.xml

1.xml:<note>------------------------------------------+
1.xml:<to>Tove</to>                                   |
1.xml:<from>Jani</from>                               |
1.xml:<heading>Reminder</heading>                     |--> To 1.xml
1.xml:<body>Don't forget me this weekend!</body>      |
1.xml:</note>-----------------------------------------+
2.xml:<note>------------------------------------------+
2.xml:<to>Tove</to>                                   |
2.xml:<from>Jani</from>                               |
2.xml:<heading>Reminder</heading>                     |--> To 2.xml
2.xml:<body>Don't forget me this weekend!</body>      |
2.xml:</note>-----------------------------------------+
3.xml:<note>------------------------------------------+
3.xml:<to>Tove</to>                                   |
3.xml:<from>Jani</from>                               |
3.xml:<heading>Reminder</heading>                     |--> To 3.xml
3.xml:<body>Don't forget me this weekend!</body>      |
3.xml:</note>-----------------------------------------+

Answer 3

一般情况下，如果没有合适的解析器，就不应该这样做。由于该示例已经是无效的xml文件，因此您可以将<note> ... </note>块分开。

如果文件具有该结构，您可以使用此awk分隔出<note> ... </note>块并写入1.xml, 2.xml...：

awk '/^<note>/ {f=1;  s=$0 ORS; next}
     /^<\/note>/ {s=s $0 ORS; print s >++i ".xml"; f=0; next}
     f {s=s $0 ORS}' file.xml

这不会支持<note> ... </note>块的任何形式的嵌套。这个或正则表达式通常是对xml的脆弱方法。

Answer 4

支持任何结构/差异。没有。行：

[gigauser@loriServer giga]$ cat big.xml
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 1.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 2.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 3.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+
[gigauser@loriServer giga]$ cat -n big.xml | sed "s/[ \t][ \t]*/ /g;s/^ //;s/ /:/"|egrep ":<note>|:<\/note>"|cut -d':' -f1 > lines.txt; c=1; while read lfrom; read lto; do sed -n "${lfrom},${lto}p" big.xml > ${c}.xml; ((c++)); done < lines.txt
[gigauser@loriServer giga]$
[gigauser@loriServer giga]$ ls -1 [1-9]*.xml
1.xml
2.xml
3.xml
[gigauser@loriServer giga]$
[gigauser@loriServer giga]$ cat -n 1.xml
     1  <note>------------------------------------------+
     2  <to>Tove</to>                                   |
     3  <from>Jani</from>                               |
     4  <heading>Reminder</heading>                     |--> To 1.xml
     5  <body>Don't forget me this weekend!</body>      |
     6  </note>-----------------------------------------+
[gigauser@loriServer giga]$
[gigauser@loriServer giga]$ cat 2.xml
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 2.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+
[gigauser@loriServer giga]$
[gigauser@loriServer giga]$ cat 3.xml; rm lines.xml
<note>------------------------------------------+
<to>Tove</to>                                   |
<from>Jani</from>                               |
<heading>Reminder</heading>                     |--> To 3.xml
<body>Don't forget me this weekend!</body>      |
</note>-----------------------------------------+

Answer 5

这是我更有效的解决方案，只读取一次文件：

#!/bin/bash
OIFS=$IFS
IFS=$'\n'
i=0
while read line
do
    if [[ "$line" == '<note>' ]]
    then
        ((i++))
    fi
    echo "$line" >> note-$i.xml
done
IFS=$OIFS

调用：

./notes-xml.sh < notes.xml

（删除旧的。）

bash：获取一对XML标记之间的内容，并将它们分别存储到多个文件中

5 个答案:

理论：

在shell中的realLife©®™日常工具：

或者您可以使用高级语言和正确的库，我想：