Question

我有一个文本文件，需要从中提取特定的数据元素。

示例文本：

<url>
    <loc>https://example.com/example0.html</loc>
    <lastmod>2019-01-22</lastmod>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/example1.html</loc>
    <lastmod>2019-01-21</lastmod>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/example2.html</loc>
    <lastmod>2019-01-21</lastmod>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/example3.html</loc>
    <lastmod>2019-01-20</lastmod>
    <priority>0.5</priority>
</url>
<url>
    <loc>https://example.com/example4.html</loc>
    <lastmod>2019-01-20</lastmod>
    <priority>0.5</priority>
</url>

我要提取：

https://example.com/example0.html
https://example.com/example1.html
https://example.com/example2.html
https://example.com/example3.html
https://example.com/example4.html

请记住日期不是静态的

Answer 1

如果您只是想通过记事本++进行提取，请执行以下操作：

https?://[^<]+

Answer 2

您可以尝试以下查找并替换：

Find:    <url>\s+<loc>(.*?)<\/loc>\s+<lastmod>.*?<\/lastmod>\s+<priority>.*?<\/priority>\s+<\/url>
Replace: $1

此答案的方法是完全匹配每个<url>标签，然后替换为模式中捕获的URL，仅保留您期望的URL。

Demo

注意：通常，不希望使用正则表达式来解析HTML / XML内容。相反，最好使用解析器。以上解决方案是针对没有内置XML解析器的Notepad ++提供的。

从文档中提取特定文本

2 个答案:

Demo