如何使用相同的CSS爬网链接

时间:2019-04-12 20:39:53

标签: php symfony curl domcrawler

我使用此代码对网站进行爬网,但我希望将链接作为单独的结果。

我希望标记结果与Artists分开,以将其放入变量中。

<?php
    require 'vendor/autoload.php';
    use Symfony\Component\DomCrawler\Crawler;
    $client = new \GuzzleHttp\Client();
    $url = 'https://hentaifox.com/gallery/58091/';
    $res = $client->request('GET', $url);
    $html = ''.$res->getBody();
    $crawler = new Crawler($html);
    foreach ($crawler->filter('#content .left_content .info .artists') as $domElement) 
    {
        $domElement = new Crawler($domElement);
        $manga_tag = $domElement->html();
        print_r($manga_tag);
        echo "<br>";
    };

1 个答案:

答案 0 :(得分:0)

我不知道如何使用Symfony的DomCrawler做到这一点,但是PHP具有不错的内置工具来解析HTML,即“ DOMDocument”和“ DOMXPath”,而在DOMDocument中,它看起来像这样:

$domd = @DOMDocument::loadHTML($html);
$xp = new DOMXPath($domd);
$tags = array();
$artists = array();
foreach ($xp->query("//a[contains(@href,'/tag/')]/span[1]") as $tag) {
    $tags[trim($tag->textContent)] = merge_relative_absolute_urls('https://hentaifox.com/gallery/58091/', $tag->parentNode->getAttribute("href"));
}
foreach ($xp->query("//a[contains(@href,'/artist/')]/span[1]") as $artist) {
    $artists[trim($artist->textContent)] = merge_relative_absolute_urls('https://hentaifox.com/gallery/58091/', $artist->parentNode->getAttribute("href"));
}
print_r([
    'artists' => $artists,
    'tags' => $tags
]);


function merge_relative_absolute_urls(string $base_url, string $relative_url): string
{
    // strip ?whatever in base url (the browser does this too, i think)
    $pos = strpos($base_url, "?");
    if (false !== $pos) {
        $base_url = substr($base_url, 0, $pos);
    }
    // strip file.php from /file.php if present
    $pos = strrpos($base_url, "/");
    if (false !== $pos) {
        $base_url = substr($base_url, 0, $pos + 1);
    }
    if (0 === stripos($relative_url, "http://") || 0 === stripos($relative_url, "https://") || 0 === strpos($relative_url, "//") || 0 === strpos($relative_url, "://")) {
        return $relative_url;
    }
    if (substr($relative_url, 0, 1) === "/") {
        $info = parse_url($base_url);
        $url = ($info['scheme'] ?? "") . "://" . $info['host'];
        if (isset($info['port'])) {
            $url .= ":" . $info['port'];
        }
        $url .= $relative_url;
        return $url;
    }
    $url = $base_url . $relative_url;
    return $url;
}

输出:

$ php wtf3.php
Array
(
    [artists] => Array
        (
            [Sahara-wataru] => https://hentaifox.com/artist/sahara-wataru/
        )

    [tags] => Array
        (
            [Big-breasts] => https://hentaifox.com/tag/big-breasts/
            [Sole-male] => https://hentaifox.com/tag/sole-male/
            [Nakadashi] => https://hentaifox.com/tag/nakadashi/
            [Blowjob] => https://hentaifox.com/tag/blowjob/
            [Full-color] => https://hentaifox.com/tag/full-color/
            [Big-ass] => https://hentaifox.com/tag/big-ass/
            [Blowjob-face] => https://hentaifox.com/tag/blowjob-face/
        )

)