结合删除标签正则表达式并删除sed中的空行 - Unix

时间:2016-10-25 01:52:53

标签: regex xml bash unix sed

给出这样的标记文件:

<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
<seg id="5">In addition, participants at the event plan to discuss the rules for forming an expert panel, which is responsible for evaluating the work of scientific groups, as well as the criteria for carrying out evaluations.</seg>
<seg id="6">The third Expert Session will be the final meeting in a series of events on the formation of a unified approach for all three academies to the evaluation of the effectiveness of activities of scientific organizations.</seg>
<seg id="7">Over the past five months, we were able to achieve this, and the final version of the regulatory documents is undergoing approval.</seg>
<seg id="8">According to the plans for the upcoming session, we should complete the development of procedures for scientometric and expert analysis, and come to an agreement on the stages and timeframes for the evaluation process”, said the Head of FANO’s Expert-Analytical Department, Elena Aksenova.</seg>
<seg id="9">Representatives from more than one hundred Russian scientific institutes will take part in the event.</seg>
<seg id="10">It is expected that a resolution will be adopted based on its results.</seg>
<seg id="11">The meeting will begin at 10 am, Moscow time, on September 16, 2014, at the following address: 14 Solyanka Street, Moscow.</seg>
</p>
</doc>
</srcset>

我可以使用Sed remove tags from html file删除标记代码:

sed -e 's/<[^>]*>//g' file.txt 

这将留下我的空行输出,我必须这样做Delete empty lines using SED

sed -e 's/<[^>]*>//g' file.txt  | sed '/^\s*$/d'

我应该如何组合删除标记并删除空行正则表达式?

1 个答案:

答案 0 :(得分:2)

立即删除怎么样? :

sed -e 's/<[^>]*>//g;/^\s*$/d' file.txt