php包括xpath文本echo中包含的任何href链接

时间:2017-06-10 06:29:28

标签: php xpath hyperlink

试图擅长php网络报废。做一些测试,我已经把这些信息从一个站点抓到/回显到另一个站点,但是我也无法在源代码中包含原始链接,这是我理想的做法。有关如何用我所拥有的东西实现这一目标的任何想法? (我对php btw很新。)

这是php代码:

// news
$doc = new DOMDocument;

// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;

// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;

$doc->loadHTMLFile('https://www.usatoday.com/');

$xpath = new DOMXPath($doc);

$query = "//ul[@class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']";

$entries = $xpath->query($query);
foreach ($entries as $entry) {
 echo trim($entry->textContent);  // use `trim` to eliminate spaces
}

该代码正在吐出这一点:NBA骑士队赢得破纪录的第4场比赛,落后于欧文的40个娱乐本手表:“黑豹”预告片释放出一个可怕的国王新闻警察:伦敦桥恐怖分子策划了更多的流血事件特朗普是如何强调分歧的。 .........

现在我真正想要做的是,实际上将这些作为工作链接,这就是原始代码中的内容。这就是这些信息的源代码:

<div class="partner-heroflip-ad partner-placement ui-flip-panel size-xxs"><a 
href="#" class="partner-close"></a></div></div><p class="hfwmm-tertiary-
list-title hfwmm-light-tertiary-list-title">TOP STORIES</p><ul class="hfwmm-
list hfwmm-4uphp-list hfwmm-light-list"
data-track-prefix="flex4uphphero"><li class="hfwmm-item hfwmm-secondary-item 
hfwmm-item-2 sports-theme-bg hfwmm-first-secondary-item hfwmm-4uphp-
secondary-item"
data-asset-position="1"
data-asset-id="102694848"
 ><a class="js-asset-link hfwmm-list-link hfwmm-light-list-link hfwmm-image-
link hfwmm-secondary-link
href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
data-track-display-type="thumb"
data-ht="flex4uphpherostack1"
data-asset-id="102694848"                 
><span class="hfwmm-image-gradient hfwmm-secondary-image-gradient"></span>
<span class="js-asset-section theme-bg-ssts-label hfwmm-ssts-label-top-left 
hfwmm-ssts-label-secondary sports-theme-bg">NBA</span><img 
src="https://www.gannett-cdn.com/-
mm-/cd17823b265aa373c83094fc75525710f645ec90/c=0-178-4072-
81338209183-USP-NBA-FINALS-GOLDEN-STATE-WARRIORS-AT-CLEVELAND-91573076.JPG"
 class="hfwmm-image hfwmm-secondary-image js-asset-image placeholder-hide"
  alt="Kyrie Irving reacts after making a basket against the"
  data-id="102695338"
  data-crop="16_9"
  width="239"
  height="135" /><span class="hfwmm-secondary-hed-wrap hfwmm-secondary-text-
hed-wrap"><span class="hfwmm-text-hed-icon js-asset-disposable"></span><span
  title="Cavs win record-breaking Game 4 behind Irving&#39;s 40"
  class="js-asset-headline hfwmm-list-hed hfwmm-secondary-hed placeholder-
hide">
      Cavs win record-breaking Game 4 behind Irving&#39;s 40
     hfwmm-item-3 life-theme-bg hfwmm-4uphp-secondary-item"
   data-asset-position="2"

为了理智,上面的href是href =“/ story / sports / nba / 2017/06/10 / kyrie-irving-lebron-james-cavs-win-game-     102694848分之4/“

在这个测试场景中如何实现这一点的任何想法都会非常有帮助。非常感谢你。 -wilson

1 个答案:

答案 0 :(得分:1)

您需要将元素作为字符串输出,您只需提取元素的文本(与XML不同)。元素可以是<a>some text</a>文本只是some text

要输出标签,请使用...

$query = "//ul[@class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']//a";

$entries = $xpath->query($query);
foreach ($entries as $entry) {
    $newdoc = new DOMDocument();
    $cloned = $entry->cloneNode(TRUE);
    $newdoc->appendChild($newdoc->importNode($cloned,TRUE));
    echo $newdoc->saveHTML();
    //echo trim((string)$entry);  // use `trim` to eliminate spaces
}

另请注意,我在XPath表达式的末尾添加了// a,以限制选择到您获取的段中的链接。这可能是您想要的,也可能不是,但请查看结果并查看。

编辑:

要操纵href,然后使用类似......

foreach ($entries as $entry) {
    $oldHref = (string)$entry->getAttribute("href");
    $entry->setAttribute("href", "http://someserver.com".$oldHref);
    $newdoc = new DOMDocument();
    $cloned = $entry->cloneNode(TRUE);
    $newdoc->appendChild($newdoc->importNode($cloned,TRUE));
     echo $newdoc->saveHTML();
}