无法获取相关链接并丢弃其他链接

时间:2019-02-15 11:15:25

标签: python python-3.x selenium web-scraping beautifulsoup

我已经用python与硒以及BeautifulSoup结合编写了一个脚本,以从网页获取指向属性详细信息的链接。由于内容是高度动态的,因此我利用硒来获取页面源。运行脚本时,我会得到很多链接,包括那些必需的链接。

如何从三个容器中的每个容器中仅获取相关链接?

我的尝试:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def fetch_info(link):
    driver.get(link)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
    soup = BeautifulSoup(driver.page_source, "lxml")
    linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")]
    return linklist

if __name__ == '__main__':
    url = "https://www.khov.com/find-new-homes/arizona/buckeye"
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)
    for newlink in fetch_info(url):
        print(newlink)
    driver.quit()

我得到的结果:

/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-at-silverstone
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/skye
/find-new-homes/arizona/phoenix/85020/k-hovnanian-homes/pointe-16
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/fusion-ii-at-the-meadows
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/aire
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-at-silverstone
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/montage-at-the-meadows
/find-new-homes/arizona/sun-city/85373/four-seasons/k.-hovnanian-s-four-seasons-at-ventana-lakes
/find-new-homes/arizona/peoria/85382/k-hovnanian-homes/park-paseo
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/affinity-at-montana-vista
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/aspire-at-montana-vista
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-ii-at-silverstone
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-ii-at-silverstone

我想要得到的结果:

/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado

一大堆html元素(the link I'm after is within the second line of the following elements):

<div class="propertyWrapper clear">
        <a href="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills"><span class="link-outside"></span></a>
        <div class="propertyCarouselWrapper">
            <div class="responsiveImageCarousel enabled" style="touch-action: pan-y; user-select: none; -webkit-user-drag: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
                <div class="prevBtn"></div>
                <div class="nextBtn"></div>
                <div class="images" data-detail-url="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills">
                    <ul style="width: 960px; left: 0px;">
                        <li style="width: 320px;"><img alt="holiday exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/holiday-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&amp;build=1019&amp;encoder=wic&amp;useresizingpipeline=true&amp;w=450&amp;h=280&amp;mode=crop"></li>
                        <li style="width: 320px;"><img alt="carnival exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/carnival-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&amp;build=1019&amp;encoder=wic&amp;useresizingpipeline=true&amp;w=450&amp;h=280&amp;mode=crop"></li>
                    </ul>
                </div>
                <div class="pagination" style="width: 56px;"><ul><li class="active"></li><li></li></ul></div>
            </div>
        </div>
        <div class="propertyInfoWrapper">
            <div class="marker-details-container">
                <h3 class="marker-details">New Homes in Buckeye, Arizona</h3>
                <div class="spacer"></div>
                <h4 class="propertyListingHeader">Aspire at Sienna Hills</h4>
                <p class="marker-details">21007 West Almeria Road, Buckeye, AZ 85396</p>
                <p class="marker-details marker-status">Final Opportunities</p>
                <div class="spacer"></div>
                <p class="marker-details marker-price"><span class="bold">Priced from: </span>Mid $200s</p>
                <p class="marker-details"><span class="bold">Home type: </span>Single Family Homes</p>
                <p class="marker-details marker-amenities"><span class="bold">Amenities: </span>Pool, Hiking Trails, Park</p>
            </div>
            <div class="community-tag-container">
                <a href="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills#quick-move-in-homes" onclick="KHOV.Analytics.trackEvent('Qmi_Icon_Qmi');">
                    <div class="community-tag">
                        <div class="ctaDesc quick-move-in-badge link-inside">Quick Move In Homes</div>
                        <div class="ctaIcon quick-move-in-badge-icon link-inside"></div>
                    </div>
                </a>
            </div>
            <a href="#request-info-form-modal" class="open-inline-modal-link" onclick="KHOV.Analytics.trackEvent('Orange_Ri_Request_Info');">
                <div class="button orange-color requestInfoButton link-inside" data-urlname="aspire-at-sienna-hills">Request Info</div>
            </a>

        </div>
    </div>

3 个答案:

答案 0 :(得分:2)

您需要包括特色ID和结果。您可以使用或组合。最新的bs4支持import { Link } from './myreactlib/Link'

const DOMPARSER = new DOMParser().parseFromString.bind(new DOMParser())
    /* Fetch URLs from JSON */
    fetch('https://post.emilija.ch/urls.json').then((res) => {
        res.text().then((data) => {
            var frag = document.createDocumentFragment()
            var hasBegun = true
            JSON.parse(data).urls.forEach((u) => {
                try {
                    var url = new URL(u)
                }
                catch (e) {
                    console.error('URL invalid');
                    return
                }
                fetch(url).then((res) => {
                    res.text().then((htmlTxt) => {
                        /* Extract the RSS Feed URL from the website */
                        try {
                            let doc = DOMPARSER(htmlTxt, 'text/html')
                            var feedUrl = doc.querySelector('link[type="application/rss+xml"]').href
                        } catch (z) {
                            console.error('Error in parsing the website');
                            return
                        }
                        /* Fetch the RSS Feed */
                        fetch(feedUrl).then((res) => {
                            res.text().then((xmlTxt) => {
                                /* Parse the RSS Feed and display the content */
                                try {
                                    let doc = DOMPARSER(xmlTxt, "text/xml")
                                    doc.querySelectorAll('item').forEach((item) => {
                                        let temp = document.importNode(document.querySelector('template').content, true);
                                        let i = item.querySelector.bind(item)
                                        let t = temp.querySelector.bind(temp)
                                        t('h2').textContent = !!i('title') ? i('title').textContent : '-'
                                        t('a').textContent = t('a').href = !!i('link') ? i('link').textContent : '#'
                                        t('p').innerHTML = !!i('description') ? i('description').textContent : '-'
                                        t('img').src = !!i('enclosure[url]') ? i('enclosure[url]').textContent : '#' 
                                        frag.appendChild(temp)
                                    })
                                } catch (z) {
                                    console.error('Error in parsing the feed')
                                }
                                if(hasBegun) {
                                    document.querySelector('output').textContent = ''; 
                                    hasBegun = false;
                                }
                                document.querySelector('output').appendChild(frag)
                            })
                        }).catch(() => console.error('Error in fetching the RSS feed'))
                    })
                }).catch(() => console.error('Error in fetching the website'))
            })
        })
    }).catch(() => console.error('Error in fetching the URLs json'))

这也可以缩短为

    <output>Loading...</output>
<template>
    <div class="blog-post-item rss-feed"> 
        <h2></h2>
        <a href='#'></a>
        <p></p>
        <img src="#" />
    </div>
</template>

但是这种缩短可能不那么可靠。

答案 1 :(得分:0)

您可以仅在链接中检查所需的关键字并打印出来,而忽略其他关键字:

if __name__ == '__main__':
    url = "https://www.khov.com/find-new-homes/arizona/buckeye"
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver,10)
    for newlink in fetch_info(url):
        if url.split('/')[-1] in newlink:
            print(newlink)
    driver.quit()

输出:

/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado

答案 2 :(得分:0)

列表切片会起作用吗?

def fetch_info(link):
    driver.get(link)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
    soup = BeautifulSoup(driver.page_source, "lxml")
    linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")][:3]
    return linklist