在PHP中帮助使用正则表达式(解析维基百科标记)

时间:2010-01-19 16:44:38

标签: php regex wikipedia wikitext

我想从我从维基百科中提取的页面中删除这段文字。

{{Historical populations|type=USA
| 1698|4937
| 1712|5840
| 1723|7248
| 1737|10664
| 1746|11717
| 1756|13046
| 1771|21863
| 1790|33131
| 1800|60515
| 1810|96373
| 1820|123706
| 1830|202589
| 1840|312710
| 1850|515547
| 1860|813669
| 1870|942292
| 1880|1206299
| 1890|1515301
| 1900|3437202
| 1910|4766883
| 1920|5620048
| 1930|6930446
| 1940|7454995
| 1950|7891957
| 1960|7781984
| 1970|7894862
| 1980|7071639
| 1990|7322564
| 2000|8008288
| 2008*|8363710
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null&reg=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007.
}}

以下部分我也希望保留为纯文本(但不包括用“{{”和“}}”包裹的部分

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}}

感谢。

3 个答案:

答案 0 :(得分:2)

我正在使用的当前代码是以下清理Wiki页面,例如:

http://en.wikipedia.org/wiki/Tel_Aviv(您可以点击“编辑此页面”来查看标记

我得到了这个:

“并因其作为”永不沉睡的地中海大都市“而享有盛誉。国土报它是该国的金融之都,也是一个主要的表演艺术和商业中心。特拉维夫的市区是中东地区的第二大城市城市经济,被“外国政策2008年全球城市指数”评为全球城市中的第42位。它也是该地区最昂贵的城市,也是世界上第17大城市。以色列的生活成本很高,特拉维夫居住在其中最昂贵的城市。据位于纽约的人力资源咨询公司美世称,截至2008年,特拉维夫是中东地区最昂贵的城市,也是世界上最贵的城市之一。在这方面,新加坡和巴黎就在悉尼和都柏林之前。相比之下,纽约市排在第22位。“

哪个不正确,预期结果应为:

Tel Aviv-Yafo(希伯来语:תֵּל-אָבִיב-יָפוֹ;阿拉伯语:تلأبيب,Tall'Abīb),通常称为特拉维夫,是以色列第二大城市,人口估计为393,900。该市位于以色列地中海沿岸,土地面积51.8平方公里(20.0平方英里)。它是Gush Dan大都市地区规模最大,人口最多的城市,截至2008年,这里有315万人口。这座城市由特拉维夫 - 雅法市政府管理,由Ron Huldai领导。

对于这个PHP代码:

function clean_wiki_text($text)
  {
    // first get rid of UGC HTML tags
    $text = strip_tags($text);

    // keep convert tag
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text);

    // remove large blocks (treat as tags)
    $text = preg_replace("/(<![^>]+>)/", '', $text);
    $text = preg_replace('/\{\{\s?/', '<', $text);
    $text = str_replace('}}', ' />', $text);

    $text = str_replace('<! />', '', $text);

    // more wiki formatting
    $text = preg_replace("/'{2,6}/", '', $text);
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text);
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text);

    // drop page link text
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text);
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text);

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text);
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text);
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text);
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text);
    $text = preg_replace('/\n(\*+\s?)/', '', $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text);
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text);

    $text = preg_replace('/={2,}/', '', $text);
    $text = preg_replace('/{?class="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text);

    $text = trim($text);

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text);
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text);
/*
    $config = array(
      'show-body-only' => true,
      'clean'          => false, 
      'wrap'           => 0, 
      'show-warnings'  => 0,
      'show-errors'    => 0,
      'enclose-block-text'   => false,
      'vertical-space' => true,
      'output-html'    => true
    );

    // Tidy
    $tidy = new tidy;
    $tidy->parseString($text, $config, 'utf8');
    $tidy->cleanRepair();

    $text = $tidy->value;
*/
    $extras = array(
  //  "/\((.*?)\)/is" => "",
      "/\[(.*?)\]/is" => ""
    );
    $text = preg_replace(array_keys($extras), array_values($extras), $text);

    $text = str_replace(" ,", ',', $text);
    $text = str_replace(", ", ',', $text);
    $text = str_replace(",", ', ', $text);
    $text = str_replace("(, ", '(', $text);
    $text = str_replace(";,", ',', $text);

    // lets keep it plain plain plain
    $text = strip_tags($text);
//    $text = preg_replace('/\s\s+/', ' ', $text);

    $text = str_replace("|-", '', $text);
    $text = str_replace("|}", '', $text);
    $text = str_replace("|", '', $text);
    $text = str_replace('()', '', $text);
    $text = str_replace('&nbsp;', ' ', $text);

    $text = trim($text);

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    foreach ($text_arr as $paragraph) {
      if ( mb_strlen(trim($paragraph)) > 30 ) {
      $result[] = $paragraph;
      }
    }
    return $result;
  }

答案 1 :(得分:1)

只是猜测一下,但是使用维基百科的标记库(与Mediawiki捆绑在一起)会不会更容易和更安全,将其转换为HTML然后使用您碰巧熟悉的任何XML库来解析它?

可以在http://svn.wikimedia.org/doc/Parser模块中)找到API文档,它看起来并不复杂。基本上,您需要做的就是以下内容:

<?php

require_once '/path/to/mediawiki/Parser.php';
// also include whatver classes Parser depends on or use Mediawiki's autoload
// mechanism if it has any

// retrieve the content of your page in $content

$parser = new Parser();
$html   = $parser->parse($content);

$simplexml = simplexml_load_string($html);

现在您可以使用非常方便的SimpleXML对象了。当然,这仅适用于Mediawiki的解析器生成有效的XML(我敢打赌)。

此外,如果Mediawiki包含某种自动加载机制,则可以通过在Mediawiki的代码库中查找__autoloadspl_autoload_register来轻松找到它。

希望它有所帮助!

答案 2 :(得分:0)

当只提供一个示例时,制作正则表达式真的很难 - 根据我自己的清理维基百科页面的经验,我知道其他页面可能看起来有点不同。只是为了匹配你的例子:

{{.+?}}\n

仅当要删除的部分后面有换行符并且您指定DOTALLMULTILINE时才有效。匹配所有双花括号和内部的东西:

{{[^}]+}}

您可能会尝试多次运行,每次运行中删除另一个不需要的部分 - 我怀疑在单个正则表达式中匹配您所需的所有内容是可行的。