我试图清理wikitext。具体来说,我试图删除wiki文本中的所有{{.....}}
和<..>...</..>
。例如,对于这个wikitext:
&#34; {{Infobox UK place \ n | country = England \ n | official_name = Morcombelake \ n | static_image_name =来自Golden Cap的Morecombelake - geograph.org.uk - 1184424.jpg \ n | static_image_caption = Morcombelake as 从Golden Cap \ n | coordinates =看到 {{coord | 50.74361 | -2.85153 | display = inline,title}} \ n | map_type = Dorset \ n | population = \ n | population_ref = \ n | shire_district = [[West 多塞特]] \ n | shire_county = [[Dorset]] \ n | region =西南 England \ n | constituency_westminster = West Dorset \ n | post_town = \ n | postcode_district = \ n | postcode_area = DT \ n | os_grid_reference = SY405938 \ n |网站= \ n}} \ n&#39;&#39;&#39; Morcombelake&#39;&#39;&#39; (也拼写 &#39;&#39;&#39; Morecombelake&#39;&#39;)是[[Bridport]]附近的一个小村庄 [[多塞特]],[[英格兰]],在[[Whitchurch。]的古老教区内 Canonicorum]。 [[Golden Cap]],[[Jurassic Coast]]世界的一部分 遗产遗址就在附近。{{cite 网| URL = http://www.nationaltrust.org.uk/golden-cap/|title=Golden Cap | publisher = National Trust | accessdate = 2014-05-04}} \ n \ n == 参考文献== \ n {{reflist}} \ n \ n {{West Dorset}} \ n \ n \ n {{Dorset-geo-stub}} \ n [[类别:村庄in 多塞特]] \ n \ n ==外部链接 == \ n \ n * [http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html 圣加布里埃尔教区教堂] \ n \ n&#34;
如何在python中使用正则表达式来生成如下输出:
\ n&#39;&#39;&#39; Morcombelake&#39;&#39;&#39; (也拼写&#39;&#39; Morecombelake&#39;&#39;)是一个小 在[[多塞特]],[[英格兰]]附近的[[Bridport]]附近的村庄 [[Whitchurch Canonicorum]]的古代教区。 [[金帽]],部分 [[侏罗纪海岸]]世界遗产,就在附近。\ n \ n == 参考文献== \ n \ n \ n \ n \ n \ n \ n [[类别:多塞特郡的村庄]] \ n \ n == 外部链接 == \ n \ n * [http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html 圣加布里埃尔教区教堂] \ n \ n
答案 0 :(得分:0)
由于标签彼此嵌套,您可以在循环中找到并删除它们:
n = 1
while n > 0:
s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)
s
是一个包含wikitext的字符串。
您的示例中没有<...>
标记,但也应删除它们。