为什么我得到的搜索结果列表比我要抓取的网页更大

时间:2019-04-22 14:30:57

标签: python web-scraping beautifulsoup

我正在尝试收集所有待售房屋的href链接,但是当我运行我的程序时,我会得到大约50张清单,尽管这远远高于此页面上列出的/ href链接的房屋数量({ {1}}。

我尝试查看页面的源代码并交叉引用程序的结果,尽管有些匹配,但有些在网站页面(url上找不到)。 / p>

url

我正在获取50个href链接的列表

但是我期望与import requests from bs4 import BeautifulSoup as bs url='https://www.rightmove.co.uk/property-for-sale/find.html?searchType=SALE&locationIdentifier=REGION%5E1091&insId=1&radius=0.0&minPrice=&maxPrice=&minBedrooms=&maxBedrooms=&displayPropertyType=&maxDaysSinceAdded=&_includeSSTC=on&sortByPriceDescending=&primaryDisplayPropertyType=&secondaryDisplayPropertyType=&oldDisplayPropertyType=&oldPrimaryDisplayPropertyType=&newHome=&auction=false' Web_Page = requests.get(url) Soup = bs(Web_Page.text,'html.parser') Web_Section_Of_Interest= Soup.find_all('a',class_="propertyCard-link") count=0 for item in Web_Section_Of_Interest: print('https://www.rightmove.co.uk'+item.get('href')) count+=1 print(count) 网页上列出的房屋数量相匹配的列表是25。

2 个答案:

答案 0 :(得分:2)

我设法通过将类从"propertyCard-link"替换为"propertyCard-img-link"

来解决了该问题。

工作代码:

import requests
from bs4 import BeautifulSoup as bs

url='https://www.rightmove.co.uk/property-for-sale/find.html?searchType=SALE&locationIdentifier=REGION%5E1091&insId=1&radius=0.0&minPrice=&maxPrice=&minBedrooms=&maxBedrooms=&displayPropertyType=&maxDaysSinceAdded=&_includeSSTC=on&sortByPriceDescending=&primaryDisplayPropertyType=&secondaryDisplayPropertyType=&oldDisplayPropertyType=&oldPrimaryDisplayPropertyType=&newHome=&auction=false'

Web_Page = requests.get(url)
Soup = bs(Web_Page.text,'html.parser')
Web_Section_Of_Interest= Soup.find_all('a',class_="propertyCard-img-link")

count=0

for item in Web_Section_Of_Interest:
    print('https://www.rightmove.co.uk'+item.get('href'))
    count+=1

print(count)

答案 1 :(得分:1)

如果您查看要打印的实际URL,您会发现它正在打印重复的URL。因此,从技术上讲,您只能获得25。

print(count)
https://www.rightmove.co.uk/property-for-sale/property-61358637.html
https://www.rightmove.co.uk/property-for-sale/property-61358637.html
https://www.rightmove.co.uk/property-for-sale/property-57044346.html
https://www.rightmove.co.uk/property-for-sale/property-57044346.html
https://www.rightmove.co.uk/commercial-property-for-sale/property-70211329.html
https://www.rightmove.co.uk/commercial-property-for-sale/property-70211329.html
https://www.rightmove.co.uk/property-for-sale/property-68319664.html
https://www.rightmove.co.uk/property-for-sale/property-68319664.html
....

只需查看您的propertyCard-link元素中的前两个元素。一个是“摘要”,另一个是“细节”:

Web_Section_Of_Interest[0]
Out[6]: 
<a class="propertyCard-link" data-bind="click: propertyCardClick('details'), attr: { href: computedDetailsLink() }" data-test="property-details" href="/property-for-sale/property-61358637.html">
<h2 class="propertyCard-title" data-bind="text: propertyTypeFullDescription" itemprop="name">
            2 bedroom semi-detached house for sale        </h2>
<address class="propertyCard-address" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<meta content="Auckland Road, Potters Bar" data-bind="attr: { content: displayAddress }" itemprop="streetAddress"/>
<meta content="GB" data-bind="attr: { content: countryCode }" itemprop="addressCountry"/>
<span data-bind="text: displayAddress">Auckland Road, Potters Bar</span>
</address>
</a>

Web_Section_Of_Interest[1]
Out[7]: 
<a class="propertyCard-link" data-bind="click: propertyCardClick('summary'), attr: { href: computedDetailsLink() }" href="/property-for-sale/property-61358637.html">
<span data-bind="html: summary" data-test="property-description" itemprop="description">BPM Auckland are pleased to offer this spacious Extended 2 Double bedroom 1930's built semi detached house, situated in this popular location within easy reach of good schools including Dame Alice Owens. The property benefits from a large 190' rear garden and also potential for a loft conversion...</span>
</a>