Question

我试图从字符串中提取并返回子字符串，如下所示：

Array ( [query] => Array ( [normalized] => Array ( [0] => Array ( [from] => lion [to] => Lion ) ) [pages] => Array ( [36896] => Array ( [pageid] => 36896 [ns] => 0 [title] => Lion [extract] => The lion (Panthera leo) is one of the four big cats in the genus Panthera and a member of the family Felidae. With some males exceeding 250 kg (550 lb) in weight, it is the second-largest living cat after the tiger. ) ) ) )

我需要的特定子字符串始终位于 [extract] =＆gt; 和））））之间。我在正则表达式上非常糟糕，并且非常感谢任何帮助！

我试过

preg_match('/\[extract\] =>(.*?)\)\)\)\)', $c,$hits);

和其他一些东西，但没有任何效果......

编辑：以下是我使用的完整代码：

$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=txt&exsentences=2&exlimit=10&exintro=&explaintext=&iwurl=&titles=lion';

$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

preg_match('^.*\[extract\].?\=\>.?(.*).?\).?\).?\).?\)$', $c,$hits); 

print_r ($hits);

Answer 1

考虑到在你的两个例子中你都缺少模式分隔符，实际上它不会按预期工作。您还需要匹配空白的多个帐户才能获得所需的结果。

preg_match('/\[extract\]\s*=>\s*(.*?)(?:\s*\)){4}/i', $c, $hits); 
echo $hits[1];

输出

The lion (Panthera leo) is one of the four big cats in the genus Panthera and a member of the family Felidae. With some males exceeding 250 kg (550 lb) in weight, it is the second-largest living cat after the tiger.

请参阅live demo

正则表达式：

Answer 2

只是因为你想要解决这个问题真的很有趣，这里是正则表达式（编辑所以可能有空格或不存在））：

^.*\[extract\].?\=\>.?(.*).?\).?\).?\).?\)$

但你应该定义使用数组函数来到[extract]字段，你会得到它的内容，这是你想要的部分

在此测试：http://regex101.com/r/qO5vO0

for php：

preg_match('/.*?\[extract\]\s\=\>\s(.*?)\s\)\s\)\s\)\s\)/i', $c, $hits);

echo $hits[1]; //outputs captured string

Answer 3

如果你真的必须使用正则表达式（并且正如@JavierDiaz所说它似乎有点矫枉过正）那么你可以使用它：

\[extract\](.*?)\)\s\)\s\)\s\)

你的例子似乎在字符串末尾的每个右括号之间有一个空格 - 我不确定这是否有意。如果没有，请删除\s位。

基本解释如下：

\[         A literal '['
extract    A literal string
\]         A literal ']'
(          Start of capturing group
.*?        Any characters, any number of repetitions but as few as possible (non-greedy)
)          End of capturing group
\)         A literal ')'
\s         Any whitespace
\)         A literal ')'
\s         Any whitespace
\)         A literal ')'
\s         Any whitespace
\)         A literal ')'

你可以使用Expresso或任何其他免费的RegEx编辑器（很多关于它们）来尝试。

编辑：我在编辑OP的问题之前将其搞砸了，这个问题补充说不应该包含=>。将开头更改为\[extract\].+?\s(.*?)可以解决问题，但在另一个答案中Michael Perrenoud更好地涵盖了这一点。

Answer 4

这样就可以了。您首先要查找字符串extract，然后想要使用.+=>\s找到文本的开头，然后您将获取中的所有文字(.*?)非贪婪的时尚，直到找到带有\s\)\s\)\s\)\s\)的字符串结尾：

extract.+=>\s(.*?)\s\)\s\)\s\)\s\)

Regular expression visualization

Debuggex Demo

正如Javier在Steve的帖子中所述，你也可以这样做：

extract.+=>\s(.*?)(?:\s\)){4}

Regular expression visualization

Debuggex Demo

如何使用正则表达式获取特定子字符串

4 个答案: