Question

我终于通过wikipedias wiki文本解析了。我在这里有以下类型的文字：

{{Airport-list|the Solomon Islands}}

* '''AGAF''' (AFT) &ndash; [[Afutara Airport]] &ndash; [[Afutara]]
* '''AGAR''' (RNA) &ndash; [[Ulawa Airport]] &ndash; [[Arona]], [[Ulawa Island]]
* '''AGAT''' (ATD) &ndash; [[Uru Harbour]] &ndash; [[Atoifi]], [[Malaita]]
* '''AGBA''' &ndash; [[Barakoma Airport]] &ndash; [[Barakoma]]

我需要检索以模式

开头的单个数组中的所有行

* '''

我认为这里会调用一个正则表达式，但我真的搞砸了我的正则表达式部分。

另外在另一个例子中，我有以下文字：

{{otheruses}}
{{Infobox Settlement
|official_name          = Doha
|native_name        = {{rtl-lang|ar|الدوحة}} ''ad-Dawḥa''
|image_skyline          = Doha Sheraton.jpg
|imagesize              = 
|image_caption          = West Bay at night
|image_map              = QA-01.svg
|mapsize                = 100px
|map_caption            = Location of the municipality of Doha within [[Qatar]].
|pushpin_map            =
|pushpin_label_position = 
|pushpin_mapsize        = 
|subdivision_type       = [[Countries of the world|Country]]
|subdivision_name       = [[Qatar]]
|subdivision_type1      = [[Municipalities of Qatar|Municipality]]
|subdivision_name1      = [[Ad Dawhah]]
|established_title      = Established
|established_date       = 1850
|area_total_km2         = 132
|area_total_sq_mi       = 51
|area_land_km2          = 
|area_land_sq_mi        = 
|area_water_km2         = 
|area_water_sq_mi       = 
|area_water_percent     = 
|area_urban_km2         = 
|area_urban_sq_mi       =
|area_metro_km2         = 
|area_metro_sq_mi       = 
|population_as_of       = 2004
|population_note        = 
|population_footnotes = <ref name=poptotal>[http://www.planning.gov.qa/Qatar-Census-2004/Flash/introduction.html Qatar 2004 Census]</ref>
|population_total       = 339847
|population_metro       = 998651
|population_density_km2 = 2574
|population_density_sq_mi = 6690
|latd=25 |latm=17 | lats=12 |latNS=N 
|longd=51|longm=32 | longs=0| longEW=E 
|coordinates_display    = inline,title
|coordinates_type       = type:city_region:QA
|timezone               = [[Arab Standard Time|AST]]
|utc_offset             = +3
|website                = 
|footnotes              = 
}} <!-- Infobox ends -->
'''Doha''' ({{lang-ar|الدوحة}}, ''{{transl|ar|ad-Dawḥa}}'' or ''{{unicode|ad-Dōḥa}}'') is the [[capital city]] of [[Qatar]].  It has a population of 400,051 according to the 2005 census,<ref name="autogenerated1">[http://www.hotelrentalgroup.com/Qatar/Sheraton%20Doha%20Hotel%20&%20Resort.htm Sheraton Doha Hotel & Resort | Hotel discount bookings in Qatar<!-- Bot generated title -->]</ref> and is located in the [[Ad Dawhah]] municipality on the [[Persian Gulf]].  Doha is Qatar's largest city, with over 80% of the nation's population residing in Doha or its surrounding [[suburbs]], and is also the economic center of the country. 
It is also the seat of government of Qatar, which is ruled by [[Sheikh Hamad bin Khalifa Al Thani]]–the current ruling Emir of Qatar.

我需要在这里提取信息框。信息框是并且包括第一次出现

之间的所有文本

{{Infobox Settlement

并以第一次出现

结束

}} <!-- Infobox ends -->

在谈到正则表达式时我完全迷失了，我可以在这里使用帮助。我正在使用Php。

EDIT！ HELP！

我一直在争斗40个小时，我无法让愚蠢的正则表达式正常工作:(到目前为止我只是有这个：

{{信息框[^ \ B（\ r | \ n）}}（\ r | \ n）的\ B] * [\ B（\ r | \ n）}}（\ r | \ n）的（ \ r | \ n）的\ b]

但它无法正常工作我希望它能读取{{infobox并以\ n}}结尾的所有字符串数据\ n

我正在使用Php并且不能让它工作:(它只是返回第一次出现}}忽略了我希望它用前面的换行检索}}的事实。请在我浪费更多我的帮助之前请帮助对此的理智：'（

Answer 1

MediaWiki是开源的。看看他们的source code ......; - ）

Answer 2

我需要提取信息框...

尝试这一点，这一次确保dotall模式启用：

\{\{Infobox.*?(?=\}\} <!-- Infobox ends -->)

再次，解释：

(?xs)    # x=comment mode, s=dotall mode
\{\{     # two opening braces (special char, so needs escaping here.)
Infobox  # literal text
.*?      # any char (including newlines), non-greedily match zero or more times.
(?=      # begin positive lookahead
\}\}     # two closing braces
<!-- Infobox ends --> # literal text
)        # end positive lookahead

这将匹配（但不包括）结束表达式 - 如果需要，您可以删除前瞻本身并仅包含内容以使其包含结尾。

更新，根据评论回答：

\{\{Infobox.*?(?=\n\}\}\n)

与上面相同，但是lookahead会在自己的行上查找两个大括号。

要选择同时允许评论，请使用：

\{\{Infobox.*?(?=\n\}\}(?: <!-- Infobox ends-->)?\n)

Answer 3

我认为最好的方法是将所有行合并为一个字符串，尤其是对于信息框。

然后是

的内容

$ reg =“\ n（\ *'''[^ \ n] *）”;

表示第一部分（以*'''开头并且不是新行的新行之后的所有内容。）

对于第二部分我现在不确定，但这是一个很好的地方玩一下： http://www.solmetra.com/scripts/regex/index.php

以下是正则表达式语法的简短参考： http://www.regular-expressions.info/reference.html

Answer 4

我需要检索以模式* '''
开头的单个数组中的所有行

启用多线模式并确保dotall模式禁用，并使用此功能：

^\* '''.*$

解剖的表达是：

(?xm-s) # Flags:
        # x enables comment mode (spaces ignore, hashes start comments)
        # m enables multiline mode (^$ match lines)
        # -s disables dotall (. matches newline)
^       # start of line
\*      # literal asterisk
[ ]     # literal space (needs braces in comment mode, but not otherwise)
'''     # three literal apostrophes
.*      # any character (excluding newline), greedily matched zero or many times.
$       # end of line

这里需要一个简单的正则表达式

EDIT！ HELP！

4 个答案: