Question

目标：我想使用cURL在iframe中删除“Paris”一词。

假设您有一个包含iframe的简单页面：

<html>
<head>
<title>Curl into this page</title>
</head>
<body>

<iframe src="france.html" title="test" name="test">

</body>
</html>

iframe页面：

<html>
<head>
<title>France</title>
</head>
<body>

<p>The Capital of France is: Paris</p>

</body>
</html>

我的cURL脚本：

<?php>

// 1. initialize

$ch = curl_init();

// 2. The URL containing the iframe

$url = "http://localhost/test/index.html";

// 3. set the options, including the url

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// 4. execute and fetch the resulting HTML output by putting into $output

$output = curl_exec($ch);

// 5. free up the curl handle

curl_close($ch);

// 6. Scrape for a single string/word ("Paris") 

preg_match("'The Capital of France is:(.*?). </p>'si", $output, $match);
if($match) 

// 7. Display the scraped string 

echo "The Capital of France is: ".$match[1];

?>

结果=没有！

有人能帮我找到法国的首都吗？！ ;）

我需要一个例子：

解析/抓取iframe网址
卷曲网址（正如我在index.html页面上所做的那样）
解析字符串“Paris”

谢谢！

Answer 1

- Edit-- 您可以将页面内容加载到字符串中，解析iframe的字符串，然后将iframe源加载到另一个字符串中。

$wrapperPage = file_get_contents('http://localhost/test/index.html');

$pattern = '/\.*src=\".*\.html"\.*/';

$iframeSrc = preg_match($pattern, $wrapperPage, $matches);

if (!isset($matches[0])) {
    throw new Exception('No match found!');
}

$src = $matches[0];

$src = str_ireplace('"', '', $src);
$src = str_ireplace('src=', '', $src);
$src = trim($src);

$iframeContents = file_get_contents($src);

var_dump($iframeContents);

- 原始 -

确定录取率（接受以前回答的问题的答案）。

您设置卷曲处理程序的网址是包装i-frame的文件，请尝试将其设置为iframe的网址：

$url = "http://localhost/test/france.html";

Answer 2

请注意，由于各种原因偶尔会出现iframe curl无法在自己的服务器上下文之外读取并查看curl直接抛出某些类型的“无法直接读取或外部”错误消息。< / p>

在这些情况下，您可以使用curl_setopt（$ ch，CURLOPT_REFERER，$ fullpageurl）; （如果你在php中并使用curl_exec阅读文本）然后curl_exec认为iframe在原始页面中，你可以阅读源代码。

因此，无论出于何种原因，无法在包含iframe的较大页面的上下文之外读取france.html，您仍然可以使用上面的方法使用CURLOPT_REFERER并设置主页面（测试/索引）来获取源代码原始问题中的.html）作为推荐人。

Answer 3

要回答您的regex问题，您的模式与输入文字不匹配：

          <p>The Capitol of France is: Paris</p>

在结束段落标记之前有一个额外的空格，它永远不会匹配：

preg_match("'The Capitol of France is:(.*?). </p>'si"

您应该在捕获组之前有空格，然后删除多余的.：

preg_match("'The Capitol of France is: (.*?)</p>'si"

要在两个位置中的任何一个位置使用可选空格，请改为使用\s*：

preg_match("'The Capitol of France is:\s*(.*?)\s*</p>'si"

您还可以使捕获组仅与(\w+)匹配的字母更具体。

如何使用cURL刮取iframe内容

3 个答案: