作为背景,我正在使用Python抓取网页并使用BeautifulSoup。
我需要访问的一些信息是关于当鼠标悬停在用户的个人资料图片上时弹出的用户个人资料的一个小框。问题是,这些信息在html中不可用,相反,我得到以下内容:
"" div class =" username mo" span class =" expand_inline scrname mbrName_1586A02614A388AEE215B4A3139A2C18" onclick =" ta.trackEventOnPage('评论',' show_reviewer_info_window',' user_name_name_click')"> Sapphire-Ed "" (我删除了一些> s,以便html会出现在问题中,抱歉!)
谁能告诉我怎么做?谢谢你的帮助!!
以下是有用的网页: 视图源:http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html 我试图访问的信息是评论分发。
答案 0 :(得分:1)
下面是输出字典的完整工作代码,其中键是用户名,值是审阅分发。要了解代码的工作原理,以下是要考虑帐户的关键事项:
uid
和src
uid
属性中的正则表达式提取src
和id
值BeautifulSoup
requests.Session
代码:
import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
data = {}
# this pattern would help us to extract uid and src needed to make a GET request
pattern = re.compile(r"UID_(\w+)-SRC_(\w+)")
# making a web-scraping session
with requests.Session() as session:
response = requests.get("http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html")
soup = BeautifulSoup(response.content, "lxml")
# iterating over usernames on the page
for member in soup.select("div.member_info div.memberOverlayLink"):
# extracting uid and src from the `id` attribute
match = pattern.search(member['id'])
if match:
username = member.find("div", class_="username").text.strip()
uid, src = match.groups()
# making a GET request for the overlay information
response = session.get("http://www.tripadvisor.com/MemberOverlay", params={
"uid": uid,
"src": src,
"c": "",
"fus": "false",
"partner": "false",
"LsoId": ""
})
# getting the grades dictionary
soup_overlay = BeautifulSoup(response.content, "lxml")
data[username] = {grade_type: soup_overlay.find("span", text=grade_type).find_next_sibling("span", class_="numbersText").text.strip(" ()")
for grade_type in ["Excellent", "Very good", "Average", "Poor", "Terrible"]}
pprint(data)
打印:
{'Anna T': {'Average': '2',
'Excellent': '0',
'Poor': '0',
'Terrible': '0',
'Very good': '2'},
'Arlyss T': {'Average': '0',
'Excellent': '6',
'Poor': '0',
'Terrible': '0',
'Very good': '1'},
'Bf B': {'Average': '1',
'Excellent': '22',
'Poor': '0',
'Terrible': '0',
'Very good': '17'},
'Charmingnl': {'Average': '15',
'Excellent': '109',
'Poor': '4',
'Terrible': '4',
'Very good': '45'},
'Jackie M': {'Average': '2',
'Excellent': '10',
'Poor': '0',
'Terrible': '0',
'Very good': '4'},
'Jonathan K': {'Average': '69',
'Excellent': '90',
'Poor': '6',
'Terrible': '0',
'Very good': '154'},
'Sapphire-Ed': {'Average': '8',
'Excellent': '47',
'Poor': '2',
'Terrible': '0',
'Very good': '49'},
'TundraJayco': {'Average': '14',
'Excellent': '59',
'Poor': '0',
'Terrible': '1',
'Very good': '49'},
'Versrii': {'Average': '2',
'Excellent': '8',
'Poor': '0',
'Terrible': '0',
'Very good': '10'},
'tripavisor83': {'Average': '12',
'Excellent': '9',
'Poor': '1',
'Terrible': '0',
'Very good': '20'}}