从网页上刮取源代码<script>标记

时间:2017-09-11 14:35:18

标签: php screen-scraping

我正在寻找一种方法来获取一些源代码。我需要的信息是在类似的标签内。

&#xA;&#xA;
 &lt; script&gt;&#xA; .......&#xA; var playerIdMap = {};&#xA; playerIdMap ['4'] ='614';&#xA; playerIdMap ['5'] ='84';&#xA; playerIdMap ['6'] ='65'; &#xA; playerIdMap ['7'] ='701';&#xA; getPlayerIdMap = function(){return playerIdMap; }; // global&#xA;}&#xA; enclosePlayerMap();&#xA;&lt; / script&gt;&#xA;  
&#xA;&#xA;

我正在尝试获取playerIdMap数字的内容,例如:4和614,或整个行。

&#xA;

1 个答案:

答案 0 :(得分:1)

修改-2

完整的PHP代码受到How to get data from API - php - curl

代码的启发
<?php
/**
 * Handles making a cURL request
 *
 * @param string $url         URL to call out to for information.
 * @param bool   $callDetails Optional condition to allow for extended
 *   information return including error and getinfo details.
 *
 * @return array $returnGroup cURL response and optional details.
 */
function makeRequest($url, $callDetails = false)
{
  // Set handle
  $ch = curl_init($url);

  // Set options
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  // Execute curl handle add results to data return array.
  $result = curl_exec($ch);
  $returnGroup = ['curlResult' => $result,];

  // If details of curl execution are asked for add them to return group.
  if ($callDetails) {
    $returnGroup['info'] = curl_getinfo($ch);
    $returnGroup['errno'] = curl_errno($ch);
    $returnGroup['error'] = curl_error($ch);
  }

  // Close cURL and return response.
  curl_close($ch);
  return $returnGroup;
}

$url = "http://www.bullshooterlive.com/my-stats/999/";
$response = makeRequest($url, true);

$re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';

preg_match_all($re, $response['curlResult'], $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

//var_dump($response);

修改-1

抱歉没有意识到你问过PHP问题。不知道为什么我在这里假设scrapy。无论如何在PHP代码下面应该有帮助

$re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';
$str = '<script>
.......
var playerIdMap = {};
playerIdMap[\'4\'] = \'614\';
playerIdMap[\'5\'] = \'84\';
playerIdMap[\'6\'] = \'65\';
playerIdMap[\'7\'] = \'701\';
getPlayerIdMap = function() { return playerIdMap; };   // global
}
enclosePlayerMap();
</script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

上一个回答

您可以使用以下内容

>>> data = """
... <script>
... .......
... var playerIdMap = {};
... playerIdMap['4'] = '614';
... playerIdMap['5'] = '84';
... playerIdMap['6'] = '65';
... playerIdMap['7'] = '701';
... getPlayerIdMap = function() { return playerIdMap; };   // global
... }
... enclosePlayerMap();
... </script>
... """
>>> import re
>>>
>>> regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
>>> re.findall(regex, data)
[('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]

您需要使用下面的

来访问脚本标记
data = response.xpath("//script[contains(text(),'getPlayerIdMap')]").extract_first() 

import re
regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
print(re.findall(regex, data))
[('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]