RegEx用于捕获第一个父级的子级属性值

时间:2019-05-17 14:52:32

标签: javascript node.js regex xml regex-group

我正在使用nodejs和RegExp解析XML文件,但是我找不到从父级提取所有子级的方法,例如,我需要父级PARENT1的所有FormalName =“(。+)”

<TopicSet FormalName="PARENT1">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>
<TopicSet FormalName="PARENT2">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>

我尝试过这个:

<TopicSet FormalName="PARENT1">(?:(?:\s|\S)*?)TopicType FormalName="(.+)"(?:(?:\s|\S)*?)<\/TopicSet>

但是它仅返回PARENT1的第一个匹配项(Child1),而不返回Child1,Child2和Child3

https://regex101.com/r/3ESH29/2/

2 个答案:

答案 0 :(得分:4)

使用正则表达式解析xml是not advisable

您可以使用DOMParser而不是使用正则表达式,例如使用querySelectorAll来获取PARENT1中FormalName的值:

使用jsdom

的示例

let xml = `<TopicSet FormalName="PARENT1">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>
<TopicSet FormalName="PARENT2">
    <Topic>
      <TopicType FormalName="Child1" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child2" />
    </Topic>
    <Topic>
      <TopicType FormalName="Child3" />
    </Topic>
</TopicSet>`;

let parser = new DOMParser();
let doc = parser.parseFromString(xml, "text/xml");
let res = doc.querySelectorAll("TopicSet[FormalName='PARENT1'] Topic TopicType");
res.forEach(e => console.log(e.getAttribute("FormalName")));

答案 1 :(得分:0)

使用正则表达式执行此操作可能不是最好的主意。但是,如果需要,您可能需要创建三个捕获组,并以父级打开/关闭标签为左/右边界,然后在它们之间滑动所有内容:

(<TopicSet.*?>)([\s\S]*?)(<\/TopicSet>)

enter image description here

RegEx

如果这不是您想要的表达式,则可以在regex101.com中修改/更改表达式。

RegEx电路

您还可以在jex.im中可视化您的表达式:

enter image description here

JavaScript演示

const regex = /(<TopicSet.*?>)([\s\S]*?)(<\/TopicSet>)/mg;
const str = `<TopicSet FormalName="PARENT1">
	<Topic>
	  <TopicType FormalName="Child1" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child2" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child3" />
	</Topic>
</TopicSet>
<TopicSet FormalName="PARENT2">
	<Topic>
	  <TopicType FormalName="Child1" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child2" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child3" />
	</Topic>
</TopicSet>`;
const subst = `$2`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

JavaScript演示2

如果您还希望打印父标签,则只需将其替换为$1$2$3而不是$2,在这里我们将其添加为易于调用:

const regex = /(<TopicSet.*?>)([\s\S]*?)(<\/TopicSet>)/mg;
const str = `<TopicSet FormalName="PARENT1">
	<Topic>
	  <TopicType FormalName="Child1" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child2" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child3" />
	</Topic>
</TopicSet>
<TopicSet FormalName="PARENT2">
	<Topic>
	  <TopicType FormalName="Child1" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child2" />
	</Topic>
	<Topic>
	  <TopicType FormalName="Child3" />
	</Topic>
</TopicSet>`;
const subst = `$1$2$3`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Demo


如果只想提取第一个父对象,则可以添加另一个边界:

(<TopicSet FormalName="PARENT1">)([\s\S]*?)(<\/TopicSet>)

Demo