Question

我想删除文本文件中任何多次出现的字符串，只留下第一个实例。

起点：

<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>

期望的结果：

<topichead navtitle="AAAA"><topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>

我必须在此之后摆脱</topichead>的大多数情况，但是一旦我得到第一部分，这些将很容易匹配和删除

基于我在this page上看到的内容，我写了这个：

 <replaceregexp byline="false" flags="g">
     <regexp pattern="(&lt;topichead.*&gt;)(r?\n\1)+"/>
     <substitution expression="/1"/>
     <fileset dir=".">
     <include name="*.txt"/>
     </fileset>
   </replaceregexp>

然而它无法正常工作。作为测试，如果我从正则表达式模式中删除(r?\n\1)+并且只匹配(<topichead.*>)的所有实例并简单地将其替换为XXX或其他任何实例。所以我知道事情是正确的。我还为第二组尝试了(\1)+，但到目前为止，上述目标没有任何效果。欢迎任何想法。

更新

用更好的例子更新这个，我给的那个有点不精确：我需要做的更像是：

起点：

<topichead navtitle="AAAA"><topicref href="XYZ"/></topichead>
<topichead navtitle="AAAA"><topicref href="ZYX"/></topichead>
<topichead navtitle="AAAA"><topicref href="XXYYZZ"/></topichead>
<topichead navtitle="AAAA"><topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB"><topicref href="ZZZYXZ"/></topichead>
<topichead navtitle="BBBB"><topicref href="yyYYZZXX"/></topichead>
<topichead navtitle="BBBB"><topicref href="XX"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYZ"/></topichead>
<topichead navtitle="CCCC"><topicref href="ZZY"/></topichead>
<topichead navtitle="CCCC"><topicref href="XXZZY></topichead>
<topichead navtitle="CCCC"><topicref href="ZZZ"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYYZZXX"/></topichead>

期望的结果：

<topichead navtitle="AAAA">
<topicref href="XYZ"/>
<topicref href="ZYX"/>
<topicref href="XXYYZZ"/>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB">
<topicref href="ZZZYXZ"/>
<topicref href="yyYYZZXX"/>
<topicref href="XX"/>
<topicref href="YYZ"/></topichead>
<topichead navtitle="CCCC">
<topicref href="ZZY"/>
<topicref href="XXZZY>
<topicref href="ZZZ"/>
<topicref href="YYYZZXX"/></topichead>

＆＃34; XXYYZZ＆＃34;链接是不同的（或可能是不同的），需要保留。

困难的部分是在第一个实例之后摆脱重复，例如<topichead navtitle="AAAA">

如果我能够得到这个结果，作为第一步：

<topichead navtitle="AAAA"><topicref href="XYZ"/></topichead>
                           <topicref href="ZYX"/></topichead>
                           <topicref href="XXYYZZ"/></topichead>
                           <topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB"><topicref href="ZZZYXZ"/></topichead>
                           <topicref href="yyYYZZXX"/></topichead>
                           <topicref href="XX"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYZ"/></topichead>
                           <topicref href="ZZY"/></topichead>
                           <topicref href="XXZZY></topichead>
                           <topicref href="ZZZ"/></topichead>
                           <topicref href="YYYZZXX"/></topichead>

然后我可以轻松地删除不需要的尾随</topichead>条目，使用：

 <replaceregexp byline="false" flags="gs">
 <regexp pattern="&lt;/topichead&gt;\r\n&lt;topicref"/>
 <substitution expression="${line.separator}&lt;topicref"/>
 <fileset dir=".">
 <include name="*.txt"/>
 </fileset>
 </replaceregexp>

...并获得上面显示的所需结果。

我现在正在这样做，使用搜索和替换第一步，然后使用replaceregexp跟进。我有很多这些要做的长列表，所以自动完成它会很棒。

我已经看过许多建议，这些建议本质上都是使用此作为核心(\r?\n\1)的变体，以不同的方式，但没有运气得到满足我需要的任何东西。

Answer 1

更新后，我得到了你的支持。这似乎是您原始输入的一行：

<topichead navtitle="CCCC"><topicref href="XXZZY></topichead>

很可能是：

<topichead navtitle="CCCC"><topicref href="XXZZY"/></topichead>

然后，解决方案如下：

    <target name="test2">
        <replaceregexp byline="false" flags="gs">
     <regexp pattern="(&lt;topichead\s+navtitle=&quot;[^&quot;]*&quot;&gt;)(&lt;topicref\s+href=&quot;[^&quot;]*&quot;/&gt;)&lt;/topichead&gt;(?=.*\1)"/>
     <substitution expression="\2"/>
     <fileset dir=".">
        <include name="*.txt"/>
     </fileset>
   </replaceregexp> 
    </target>

    <target name="test" depends="test2">
        <replaceregexp byline="false" flags="gs">
     <regexp pattern="(&lt;topicref.*?)(&lt;topichead\s+navtitle=&quot;[^&quot;]*&quot;&gt;)(&lt;topicref\s+href=&quot;[^&quot;]*&quot;/&gt;&lt;/topichead&gt;)"/>
     <substitution expression="\2${line.separator}\1\3"/>
     <fileset dir=".">
        <include name="*.txt"/>
     </fileset>
   </replaceregexp> 
    </target>

运行ant test后：
你会得到你想要的结果如下：

<topichead navtitle="AAAA">
<topicref href="XYZ"/>
<topicref href="ZYX"/>
<topicref href="XXYYZZ"/>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB">
<topicref href="ZZZYXZ"/>
<topicref href="yyYYZZXX"/>
<topicref href="XX"/></topichead>
<topichead navtitle="CCCC">
<topicref href="YYZ"/>
<topicref href="ZZY"/>
<topicref href="XXZZY"/>
<topicref href="ZZZ"/>
<topicref href="YYYZZXX"/></topichead>

Answer 2

一个示例：

   <replaceregexp byline="false" flags="g">
     <regexp pattern="(&lt;topichead.*&gt;)(?=\r?\n\1)"/>
     <substitution expression="&lt;topicref href=&quot;____&quot;/&gt;&lt;/topichead&gt;"/>
     <fileset dir=".">
        <include name="*.txt"/>
     </fileset>
   </replaceregexp>

输出如下：

<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>

结果只留下最后一个实例，而不是第一个实例。 FYI。

Ant任务删除第一个字符串后出现的字符串

2 个答案: