从hoverbox

时间:2016-01-23 00:05:22

标签: python html web-scraping beautifulsoup

作为背景,我正在使用Python抓取网页并使用BeautifulSoup。

我需要访问的一些信息是关于当鼠标悬停在用户的个人资料图片上时弹出的用户个人资料的一个小框。问题是,这些信息在html中不可用,相反,我得到以下内容:

"" div class =" username mo"  span class =" expand_inline scrname mbrName_1586A02614A388AEE215B4A3139A2C18" onclick =" ta.trackEventOnPage('评论',' show_reviewer_info_window',' user_name_name_click')"> Sapphire-Ed "" (我删除了一些> s,以便html会出现在问题中,抱歉!)

谁能告诉我怎么做?谢谢你的帮助!!

以下是有用的网页: 视图源:http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html 我试图访问的信息是评论分发。

1 个答案:

答案 0 :(得分:1)

下面是输出字典的完整工作代码,其中键是用户名,值是审阅分发。要了解代码的工作原理,以下是要考虑帐户的关键事项:

  • 鼠标悬停中显示的叠加层中的信息是通过HTTP GET请求动态加载的,其中包含许多特定于用户的参数 - 最重要的是uidsrc
  • 可以使用每个用户个人资料元素的uid属性中的正则表达式提取srcid
  • 对此GET请求的响应是HTML,您需要使用BeautifulSoup
  • 进行解析
  • 您应该使用requests.Session
  • 维护网络抓取会话

代码:

import re
from pprint import pprint

import requests
from bs4 import BeautifulSoup

data = {}

# this pattern would help us to extract uid and src needed to make a GET request
pattern = re.compile(r"UID_(\w+)-SRC_(\w+)")

# making a web-scraping session
with requests.Session() as session:
    response = requests.get("http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html")
    soup = BeautifulSoup(response.content, "lxml")

    # iterating over usernames on the page
    for member in soup.select("div.member_info div.memberOverlayLink"):
        # extracting uid and src from the `id` attribute
        match = pattern.search(member['id'])
        if match:
            username = member.find("div", class_="username").text.strip()
            uid, src = match.groups()

            # making a GET request for the overlay information
            response = session.get("http://www.tripadvisor.com/MemberOverlay", params={
                "uid": uid,
                "src": src,
                "c": "",
                "fus": "false",
                "partner": "false",
                "LsoId": ""
            })

            # getting the grades dictionary
            soup_overlay = BeautifulSoup(response.content, "lxml")
            data[username] = {grade_type: soup_overlay.find("span", text=grade_type).find_next_sibling("span", class_="numbersText").text.strip(" ()")
                              for grade_type in ["Excellent", "Very good", "Average", "Poor", "Terrible"]}


pprint(data)

打印:

{'Anna T': {'Average': '2',
            'Excellent': '0',
            'Poor': '0',
            'Terrible': '0',
            'Very good': '2'},
 'Arlyss T': {'Average': '0',
              'Excellent': '6',
              'Poor': '0',
              'Terrible': '0',
              'Very good': '1'},
 'Bf B': {'Average': '1',
          'Excellent': '22',
          'Poor': '0',
          'Terrible': '0',
          'Very good': '17'},
 'Charmingnl': {'Average': '15',
                'Excellent': '109',
                'Poor': '4',
                'Terrible': '4',
                'Very good': '45'},
 'Jackie M': {'Average': '2',
              'Excellent': '10',
              'Poor': '0',
              'Terrible': '0',
              'Very good': '4'},
 'Jonathan K': {'Average': '69',
                'Excellent': '90',
                'Poor': '6',
                'Terrible': '0',
                'Very good': '154'},
 'Sapphire-Ed': {'Average': '8',
                 'Excellent': '47',
                 'Poor': '2',
                 'Terrible': '0',
                 'Very good': '49'},
 'TundraJayco': {'Average': '14',
                 'Excellent': '59',
                 'Poor': '0',
                 'Terrible': '1',
                 'Very good': '49'},
 'Versrii': {'Average': '2',
             'Excellent': '8',
             'Poor': '0',
             'Terrible': '0',
             'Very good': '10'},
 'tripavisor83': {'Average': '12',
                  'Excellent': '9',
                  'Poor': '1',
                  'Terrible': '0',
                  'Very good': '20'}}