试图擅长php网络报废。做一些测试,我已经把这些信息从一个站点抓到/回显到另一个站点,但是我也无法在源代码中包含原始链接,这是我理想的做法。有关如何用我所拥有的东西实现这一目标的任何想法? (我对php btw很新。)
这是php代码:
// news
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.usatoday.com/');
$xpath = new DOMXPath($doc);
$query = "//ul[@class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
该代码正在吐出这一点:NBA骑士队赢得破纪录的第4场比赛,落后于欧文的40个娱乐本手表:“黑豹”预告片释放出一个可怕的国王新闻警察:伦敦桥恐怖分子策划了更多的流血事件特朗普是如何强调分歧的。 .........
现在我真正想要做的是,实际上将这些作为工作链接,这就是原始代码中的内容。这就是这些信息的源代码:
<div class="partner-heroflip-ad partner-placement ui-flip-panel size-xxs"><a
href="#" class="partner-close"></a></div></div><p class="hfwmm-tertiary-
list-title hfwmm-light-tertiary-list-title">TOP STORIES</p><ul class="hfwmm-
list hfwmm-4uphp-list hfwmm-light-list"
data-track-prefix="flex4uphphero"><li class="hfwmm-item hfwmm-secondary-item
hfwmm-item-2 sports-theme-bg hfwmm-first-secondary-item hfwmm-4uphp-
secondary-item"
data-asset-position="1"
data-asset-id="102694848"
><a class="js-asset-link hfwmm-list-link hfwmm-light-list-link hfwmm-image-
link hfwmm-secondary-link
href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
data-track-display-type="thumb"
data-ht="flex4uphpherostack1"
data-asset-id="102694848"
><span class="hfwmm-image-gradient hfwmm-secondary-image-gradient"></span>
<span class="js-asset-section theme-bg-ssts-label hfwmm-ssts-label-top-left
hfwmm-ssts-label-secondary sports-theme-bg">NBA</span><img
src="https://www.gannett-cdn.com/-
mm-/cd17823b265aa373c83094fc75525710f645ec90/c=0-178-4072-
81338209183-USP-NBA-FINALS-GOLDEN-STATE-WARRIORS-AT-CLEVELAND-91573076.JPG"
class="hfwmm-image hfwmm-secondary-image js-asset-image placeholder-hide"
alt="Kyrie Irving reacts after making a basket against the"
data-id="102695338"
data-crop="16_9"
width="239"
height="135" /><span class="hfwmm-secondary-hed-wrap hfwmm-secondary-text-
hed-wrap"><span class="hfwmm-text-hed-icon js-asset-disposable"></span><span
title="Cavs win record-breaking Game 4 behind Irving's 40"
class="js-asset-headline hfwmm-list-hed hfwmm-secondary-hed placeholder-
hide">
Cavs win record-breaking Game 4 behind Irving's 40
hfwmm-item-3 life-theme-bg hfwmm-4uphp-secondary-item"
data-asset-position="2"
为了理智,上面的href是href =“/ story / sports / nba / 2017/06/10 / kyrie-irving-lebron-james-cavs-win-game- 102694848分之4/“
在这个测试场景中如何实现这一点的任何想法都会非常有帮助。非常感谢你。 -wilson
答案 0 :(得分:1)
您需要将元素作为字符串输出,您只需提取元素的文本(与XML不同)。元素可以是<a>some text</a>
文本只是some text
。
要输出标签,请使用...
$query = "//ul[@class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']//a";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
//echo trim((string)$entry); // use `trim` to eliminate spaces
}
另请注意,我在XPath表达式的末尾添加了// a,以限制选择到您获取的段中的链接。这可能是您想要的,也可能不是,但请查看结果并查看。
编辑:
要操纵href,然后使用类似......
foreach ($entries as $entry) {
$oldHref = (string)$entry->getAttribute("href");
$entry->setAttribute("href", "http://someserver.com".$oldHref);
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
}