在太空之后获得一部分字符串

时间:2014-05-15 13:55:06

标签: php regex wikipedia wikipedia-api

我从维基百科APi收到字符串,如下所示:

{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics] 

我必须同时使用实际的网址和网址的说明。例如,对于  来自[[BBC新闻]]的[http://www.bbc.co.uk/news/world-europe-17298730法国] 我需要" http://www.bbc.co.uk/news/world-europe-17298730"以及[[BBC新闻]]""法国]但没有括号,就像BBC新闻"中的法国一样。

通过执行以下操作,我设法获得了第一部分:

if(preg_match_all('/\[http(.*?)\s/',$result,$extmatch)) {           
   $mt= str_replace("[[","",$extmatch[1]);

但是我不知道如何绕过第二部分(不幸的是,我在正则表达式上非常弱:-()。

有什么想法吗?

2 个答案:

答案 0 :(得分:1)

<强> PHP:

$input = "{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]";
$regex = '/\[(http\S+)\s+([^\]]+)\](?:\s+from(?:\s+the)?\s+\[\[(.*?)\]\])?/';

preg_match_all($regex, $input, $matches, PREG_SET_ORDER);
var_dump($matches);

<强>输出:

array(6) {
  [0]=>
  array(4) {
    [0]=>
    string(78) "[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]"
    [1]=>
    string(47) "http://www.bbc.co.uk/news/world-europe-17298730"
    [2]=>
    string(6) "France"
    [3]=>
    string(8) "BBC News"
  }
  ...
  ...
  ...
  ...
  ...
}

<强>解释

\[       (?# match [ literally)
(        (?# start capture group)
  http   (?# match http literally)
  \S+    (?# match 1+ non-whitespace characters)
)        (?# end capture group)
\s+      (?# match 1+ whitespace characters)
(        (?# start capture group)
  [^\]]+ (?# match 1+ non-] characters)
)        (?# end capture group)
\]       (?# match ] literally)
(?:      (?# start non-capturing group)
  \s+    (?# match 1+ whitespace characters)
  from   (?# match from literally)
  (?:    (?# start non-capturing group)
    \s+  (?# match 1+ whitespace characters)
    the  (?# match the literally)
  )?     (?# end optional non-capturing group)
  \s+    (?# match 1+ whitespace characters)
  \[\[   (?# match [[ literally)
  (      (?# start capturing group)
    .*?  (?# lazily match 0+ characters)
  )      (?# end capturing group)
  \]\]   (?# match ]] literally)
)?       (?# end optional non-caputring group)

如果您需要更全面的解释,请告诉我,但我上面的评论应该有所帮助。如果您有任何具体问题,我非常乐意提供帮助。下面的链接将帮助您可视化表达式正在做什么。

Regex101

答案 1 :(得分:1)

不使用正则表达式的解决方案:

  1. 在&#39; *&#39;
  2. 中分解字符串
  3. 从&#39; {&#39;;
  4. 开始抛弃零件
  5. 删除所有括号
  6. 在&#39;空间&#39;
  7. 中分解字符串
  8. 第一部分是链接
  9. 将其余部分粘合在一起以获取描述
  10. 代码:

    $parts=explode('*',$str);
    $links=array();
    foreach($parts as $k=>$v){
        $parts[$k]=ltrim($v);
        if(substr($parts[$k],0,1)!=='['){
            unset($parts[$k]);
            continue;
            }
        $parts[$k]=preg_replace('/\[|\]/','',$parts[$k]);
        $subparts=explode(' ',$parts[$k]);
        $links[$k][0]=$subparts[0];
            unset($subparts[0]);
        $links[$k][1]=implode(' ',$subparts);
        }
    
    echo '<pre>'.print_r($links,true).'</pre>';
    

    结果:

    Array
    (
        [1] => Array
            (
                [0] => http://www.bbc.co.uk/news/world-europe-17298730
                [1] => France from the BBC News 
            )
    
        [2] => Array
            (
                [0] => http://ucblibraries.colorado.edu/govpubs/for/france.htm
                [1] => France at ''UCB Libraries GovPubs'' 
            )
    
        [4] => Array
            (
                [0] => http://www.britannica.com/EBchecked/topic/215768/France
                [1] => France ''Encyclopædia Britannica'' entry 
            )
    
        [5] => Array
            (
                [0] => http://europa.eu/about-eu/countries/member-countries/france/index_en.htm
                [1] => France at the European Union|EU 
            )
    
        [8] => Array
            (
                [0] => http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR
                [1] => Key Development Forecasts for France from International Futures ;Economy 
            )
    
        [10] => Array
            (
                [0] => http://stats.oecd.org/Index.aspx?QueryId=14594
                [1] => OECD France statistics 
            )
    
    )