我有一个网址抓取器设置,它工作正常。它抓取响应头中的doc的url,例如:
<script type='text/javascript' language='JavaScript'>
document.location.href = 'http\x3a\x2f\x2fcms.example.com\x2fd\x2fd\x2fworkspace\x2fSpacesStore\x2f61d96949-b8fb-43f1-adaf-0233368984e0\x2fFinancial\x2520Agility\x2520Report.pdf\x3fguest\x3dtrue'
</script>
这是我的抓手脚本。
<?php
set_time_limit(0);
$target_url = $_POST['to'];
$html =file_get_contents($target_url);
$pattern = "/document.location.href = '([^']*)'/";
preg_match($pattern, $html, $matches, PREG_OFFSET_CAPTURE, 3);
$raw_url = $matches[1][0];
$eval_url = '$url = "'.$raw_url.'";';
eval($eval_url);
echo $url;
我们必须在我们的文档管理系统中添加一个变量,因此每个文档URL都需要?guest = url末尾的true。当我们这样做时,我的抓取器返回完整的URL并将其附加到文件名。所以我试着让它只抓住url,直到它达到/ guest = true。使用此代码:
<?php
set_time_limit(0);
$target_url = $_POST['to'];
$html =file_get_contents($target_url);
$pattern = "/document.location.href = '([^']*)\x3fguest\x3dtrue'/";
preg_match($pattern, $html, $matches, PREG_OFFSET_CAPTURE, 3);
$raw_url = $matches[1][0];
$eval_url = '$url = "'.$raw_url.'";';
eval($eval_url);
echo $url;
为什么它不会返回url直到?guest = true部分?又说为什么这不起作用?什么是修复?
答案 0 :(得分:1)
这是解决方案。您将直接获得比赛,而不是分组。
set_time_limit(0);
$target_url = $_POST['to'];
$html = file_get_contents($target_url);
$pattern = '/(?<=document\.location\.href = \').*?(?=\\\\x3fguest\\\\x3dtrue)/';
preg_match($pattern, $html, $matches))
$raw_url = $matches[0];
$eval_url = '$url = "'.$raw_url.'";';
eval($eval_url);
echo $url;
您可以查看结果 here 。
你的正则表达式的问题在于你没有逃避字符串(.
和\
)中你想要捕捉文学的某些字符。此外,您不需要使用PREG_OFFSET_CAPTURE
和3
的偏移量。我猜您从this page上的示例中复制了这些值。
以下是正则表达式模式的解释:
# (?<=document\.location\.href = ').*?(?=\\x3fguest\\x3dtrue)
#
# Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=document\.location\.href = ')»
# Match the characters “document” literally «document»
# Match the character “.” literally «\.»
# Match the characters “location” literally «location»
# Match the character “.” literally «\.»
# Match the characters “href = '” literally «href = '»
# Match any single character that is not a line break character «.*?»
# Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=\\x3fguest\\x3dtrue')»
# Match the character “\” literally «\\»
# Match the characters “x3fguest” literally «x3fguest»
# Match the character “\” literally «\\»
# Match the characters “x3dtrue” literally «x3dtrue»
此答案已经过编辑,以反映问题的更新。
答案 1 :(得分:0)
看起来你的正则表达式是错误的。您已将\?guest=true
添加到正则表达式中,字面上匹配?guest=true
。
在您的示例响应标头中,它以\x3fguest\x3dtrue
结尾,这是不同的。
尝试:
$pattern="/document.location.href = '([^']*)(\?|(\\x3f))guest(=|(\\x3d))true'/";
我只是替换了以下子表达式:
\?
现在(\?|(\\x3f))
与?
或\x3f
字面匹配=
现在(=|(\\x3d))
与=
或\x3d
字面匹配这样,如果使用?
或=
的转义十六进制表示,它仍会正确匹配。