Scraping html in python when you have more than one class with the same name

时间:2015-09-14 15:22:11

标签: python html web-scraping beautifulsoup lxml

maybe my terminology is a bit off here, but hope you get the jist. I'm trying to scrape data off a food review website which has three ratings: happy, neutral, unhappy. The number of counts of each in the website written like:

<div class="col  PL20">
  <div class="sprite-sr2-face-smile1"></div>
  <div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
  <div class="sprite-sr2-face-ok2 MT20"></div>
  <div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
  <div class="sprite-sr2-face-cry2 MT20"></div>
  <div class="sr2_score_m">2</div>
</div>

So in this case the number of happy counts is 25, neutral is 17 and unhappy is 2. Problem is what with my python code below I cannot differentiate between the neutral count and the unhappy count because the share the same class, is there a way around this?

# using BeautifulSoup4 and lxml
import urllib2 
from bs4 import BeautifulSoup  
soup = BeautifulSoup(urllib2.urlopen('http://www.openrice.com/_
en/hongkong/restaurant/central-open-kitchen/136799').read())

happy = soup.find('div', attrs={'class': 'sr2_score_l'})
print "happy rating, " + happy.string

neutral = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string

unhappy = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string

3 个答案:

答案 0 :(得分:1)

face-smile, face-ok and face-cry parts of class names are your indicators:

happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").text
ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").text
unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").text

Example code (with a nice reusable function):

import re

from bs4 import BeautifulSoup


def print_reviews_count(soup):
    indicators = {
        "happy": "face-smile",
        "ok": "face-ok",
        "unhappy": "face-cry",
    }

    for key, class_name in indicators.iteritems():
        count = soup.find("div", class_=re.compile(class_name)).find_next_sibling("div").text
        print(key, count)


source_code = """
<div class="col  PL20">
  <div class="sprite-sr2-face-smile1"></div>
  <div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
  <div class="sprite-sr2-face-ok2 MT20"></div>
  <div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
  <div class="sprite-sr2-face-cry2 MT20"></div>
  <div class="sr2_score_m">2</div>
</div>
"""

soup = BeautifulSoup(source_code, "lxml")
print_reviews_count(soup)

Prints:

('ok', u'17')
('unhappy', u'2')
('happy', u'25')

答案 1 :(得分:0)

I see two possible solutions:

  • Add another html class if you can.

or

  • Search for the class "sprite-sr2-face-cry2" in the line before the one where you found "sr2_score_m".

To do this you could create a list of strings from your html file using .splitlines(), then iterate over it and search for both classes.

答案 2 :(得分:0)

实际上使用你们的帮助我已经设法写了一个很好的功能,应该允许我重复使用该功能的网站网址列表

import re
import urllib2 
from bs4 import BeautifulSoup

website_list = [urlA, urlB....,urlX]

def ratings(website):
    soup = BeautifulSoup(urllib2.urlopen(website).read())
    happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").string
    ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").string
    unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").string
    print "happy rating, " + happy.string
    print "ok rating, " + ok.string
    print "unhappy rating, " + unhappy.string

for website in website_list:
    ratings(website)