从文本中剥离MediaWiki标记

时间:2012-09-17 00:56:49

标签: c# mediawiki text-processing strip-tags

如果有办法使用C#从文本中删除所有MediaWiki标记“代码”?

例如,我有以下文字:

<h2><span class="editsection">[<a href="/w/index.php?title=Roger_Zelazny&amp;action=edit&amp;section=1" title="Edit section: Biography">edit</a>]</span> <span class="mw-headline" id="Biography">Biography</span></h2>
<p>Roger Zelazny was born in <a href="/wiki/Euclid,_Ohio" title="Euclid, Ohio">Euclid, Ohio</a>, the only child of Polish immigrant Joseph Frank Zelazny and <a href="/wiki/Irish-American" title="Irish-American" class="mw-redirect">Irish-American</a> Josephine Flora Sweet. In high school, he became the editor of the school newspaper and joined the Creative Writing Club.<sup id="cite_ref-Roger_Zelazny_2009_0-0" class="reference">
<a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> In the fall of 1955, he began attending <a href="/wiki/Case_Western_Reserve_University" title="Case Western Reserve University">Western Reserve University</a> and graduated with a B.A. in English in 1959.<sup id="cite_ref-Roger_Zelazny_2009_0-1" class="reference"><a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> He was accepted to <a href="/wiki/Columbia_University" title="Columbia University">Columbia University</a> in New York and specialized in Elizabethan and Jacobean drama, graduating with an M.A. in 1962.<sup id="cite_ref-Roger_Zelazny_2009_0-2" class="reference">
<a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> His M.A. thesis was entitled <i>Two traditions and <a href="/wiki/Cyril_Tourneur" title="Cyril Tourneur">Cyril Tourneur</a>: an examination of morality and humor comedy conventions in</i> <a href="/wiki/The_Revenger%27s_Tragedy" title="The Revenger's Tragedy">The Revenger's Tragedy</a>. Between 1962 and 1969 he worked for the U.S. <a href="/wiki/Social_Security_Administration" title="Social Security Administration">Social Security Administration</a> in <a href="/wiki/Cleveland,_Ohio" title="Cleveland, Ohio" class="mw-redirect">Cleveland, Ohio</a> and then in <a href="/wiki/Baltimore,_Maryland" title="Baltimore, Maryland" class="mw-redirect">Baltimore, Maryland</a> spending his evenings writing science fiction.<sup id="cite_ref-Roger_Zelazny_2009_0-3" class="reference"><a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup><sup id="cite_ref-AndCall_1-0" class="reference"><a href="#cite_note-AndCall-1"><span>[</span>2<span>]</span></a></sup> 
He deliberately progressed from short-shorts to novelettes to novellas and finally to novel-length works by 1965.<sup id="cite_ref-Roger_Zelazny_2009_0-4" class="reference"><a href="#cite_note-Roger_Zelazny_2009-0"><span>[</span>1<span>]</span></a></sup> On May 1, 1969, he quit to become a full-time writer, and thereafter concentrated on writing novels in order to maintain his income.<sup id="cite_ref-AndCall_1-1" class="reference"><a href="#cite_note-AndCall-1"><span>[</span>2<span>]</span></a></sup>
During this period, he was an active and vocal member of the Baltimore Science Fiction Society, whose members included writers <a href="/wiki/Jack_Chalker" title="Jack Chalker" class="mw-redirect">Jack Chalker</a> and <a href="/wiki/Joe_Haldeman" title="Joe Haldeman">Joe</a> and <a href="/wiki/Jack_Haldeman" title="Jack Haldeman" class="mw-redirect">Jack Haldeman</a> among others.</p>

以下Html代表:

  

[edit]传记

     

Roger Zelazny出生于Euclid, Ohio,是波兰移民Joseph Frank Zelazny和爱尔兰裔美国人Josephine Flora Sweet的独生子女。在高中时,他成为学校报纸的编辑并加入了创意写作俱乐部。   [1] 1955年秋天,他开始参加Western Reserve University并以B.A.毕业。 1959年英语。[1]他在纽约被Columbia University录取并专攻伊丽莎白时代和詹姆士一世的戏剧,于1962年以M.A。毕业。   [1]他的M.A.论文题目是两种传统和Cyril Tourneur:对 The Revenger's Tragedy中的道德和幽默喜剧惯例的考察。 1962年至1969年间,他在美国俄亥俄州克里夫兰市的Social Security Administration工作,然后在马里兰州的巴尔的摩工作,在晚上写科幻小说。[1] [2]   他故意从短裤到中篇小说再到小说,最后到了1965年的小说作品。[1] 1969年5月1日,他辞去了全职作家的职务,之后专注于写小说以保持收入。[2]   在此期间,他是巴尔的摩科幻小说协会的活跃和声音成员,其成员包括作家Jack Chalker和Joe以及Jack Haldeman等。

我正在寻找一种方法来剥离不仅 HTML标签,还包括引用,wiki“链接”等等 - 我想删除所有格式和“处理” “由维基百科完成,只保留文字......

1 个答案:

答案 0 :(得分:0)

解析HTML不会让你走得太远,因为几乎不可能分辨出什么是“内容”,什么不是。你需要的是一个MediaWiki标记解析器,但是有几十个,the canonical list at mediawiki.org没有(在撰写本文时)似乎没有任何C#。

如果您最终呼叫任何外部库,mwlib可能是最成熟的。{/ p>