我最近开始考虑购买一些土地,我正在编写一个小应用程序来帮助我组织 Jira/Confluence 中的详细信息,以帮助我跟踪与谁交谈以及与他们交谈的内容对每一块土地单独进行。
所以,我为landwatch(dot)com写了这个小爬虫:
[url
只是网站上的列表]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
除了 d19de
类在 Read More
按钮后面隐藏了大量值外,一切都很好。
在 Google 上搜索时,我发现了 How to Scrape reviews with read more from Webpages using BeautifulSoup,但是我要么不明白他们在哪些方面做得足够好以实施它,要么这不再起作用:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
关于如何绕过 Read More
按钮的任何想法?我真的希望在 BeautifulSoup 中做到这一点,而不是 selenium。
以下是用于测试的示例 URL:https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
答案 0 :(得分:1)
该数据存在于 script
标记中。以下是提取该内容、使用 json
进行解析并将土地描述信息作为列表输出的示例:
from bs4 import BeautifulSoup
import requests, json
url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r')
您可能希望检查该 script
标签中的其他内容:
from pprint import pprint
pprint(all_data)