如何从网页的源代码中获取字符串?

时间:2014-08-09 08:06:22

标签: javascript php

如何从网页的源代码中抓取一个字符串?我已经浏览了整个PHP.net,我无法弄清楚PHP是否有一个函数或一组函数可以从中获取字符串。

例如,这就是我目前拥有的内容(我希望从"wgCategories"中获取"wgMonthNamesShort"$html中的所有内容):

<?php
error_reporting(E_ALL);
$html = file_get_contents('http://en.wikipedia.org/wiki/Los_Angeles');
$string = <>;
?>

首先,我将网页的源码抓取到$ html变量中。现在我需要一个函数或一组函数来捕获从"wgCategories""wgMonthNamesShort"的所有内容并将其存储到$ string中。

期望的结果:

$string = "wgCategories":["All articles with dead external links","Articles with dead external links from March 2013","Articles with dead external links from March 2014","Pages with broken reference names","Articles with dead external links from January 2014","Articles with dead external links from September 2011","Articles with dead external links from October 2011","CS1 errors: dates","Use mdy dates from May 2014","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Coordinates on Wikidata","Articles including recorded pronunciations","Articles containing Spanish-language text","All articles with unsourced statements","Articles with unsourced statements from December 2013","Spoken articles","Articles with hAudio microformats","Los Angeles, California","Cities in Los Angeles County, California","Communities on U.S. Route 66","County seats in California","Incorporated cities and towns in California","Populated coastal places in California","Populated places established in 1781","Port cities and towns of the United States Pacific coast","Butterfield Overland Mail in California","Stockton - Los Angeles Road"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort";

最后请注意,从"wgCategories""wgMonthNamesShort"的所有内容都存储在<script>个标签之间(不确定这是否重要,但有人告诉我值得一提)。< / p>

如果需要澄清,请告诉我。

1 个答案:

答案 0 :(得分:0)

您可以使用preg_match s标记(DOTALL)来标记2个关键字之间的字符串:

error_reporting(E_ALL);
$html = file_get_contents('http://en.wikipedia.org/wiki/Los_Angeles');
if (preg_match('/wgCategories.*?wgMonthNamesShort/is', $html, $matches))
   echo $matches[0];

您可以使用PHP字符串函数(例如stristr

)来避免使用正则表达式

以上代码打印:

  

wgCategories":["All articles with dead external links","Articles with dead external links from March 2013","Articles with dead external links from March 2014","Pages with broken reference names","Articles with dead external links from January 2014","Articles with dead external links from September 2011","Articles with dead external links from October 2011","CS1 errors: dates","Use mdy dates from May 2014","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Coordinates on Wikidata","Articles including recorded pronunciations","Articles containing Spanish-language text","All articles with unsourced statements","Articles with unsourced statements from December 2013","Spoken articles","Articles with hAudio microformats","Los Angeles, California","Cities in Los Angeles County, California","Communities on U.S. Route 66","County seats in California","Incorporated cities and towns in California","Populated coastal places in California","Populated places established in 1781","Port cities and towns of the United States Pacific coast","Butterfield Overland Mail in California","Stockton - Los Angeles Road"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort