maybe my terminology is a bit off here, but hope you get the jist. I'm trying to scrape data off a food review website which has three ratings: happy, neutral, unhappy. The number of counts of each in the website written like:
<div class="col PL20">
<div class="sprite-sr2-face-smile1"></div>
<div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
<div class="sprite-sr2-face-ok2 MT20"></div>
<div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
<div class="sprite-sr2-face-cry2 MT20"></div>
<div class="sr2_score_m">2</div>
</div>
So in this case the number of happy counts is 25, neutral is 17 and unhappy is 2. Problem is what with my python code below I cannot differentiate between the neutral count and the unhappy count because the share the same class, is there a way around this?
# using BeautifulSoup4 and lxml
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.openrice.com/_
en/hongkong/restaurant/central-open-kitchen/136799').read())
happy = soup.find('div', attrs={'class': 'sr2_score_l'})
print "happy rating, " + happy.string
neutral = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string
unhappy = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string
答案 0 :(得分:1)
face-smile
, face-ok
and face-cry
parts of class names are your indicators:
happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").text
ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").text
unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").text
Example code (with a nice reusable function):
import re
from bs4 import BeautifulSoup
def print_reviews_count(soup):
indicators = {
"happy": "face-smile",
"ok": "face-ok",
"unhappy": "face-cry",
}
for key, class_name in indicators.iteritems():
count = soup.find("div", class_=re.compile(class_name)).find_next_sibling("div").text
print(key, count)
source_code = """
<div class="col PL20">
<div class="sprite-sr2-face-smile1"></div>
<div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
<div class="sprite-sr2-face-ok2 MT20"></div>
<div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
<div class="sprite-sr2-face-cry2 MT20"></div>
<div class="sr2_score_m">2</div>
</div>
"""
soup = BeautifulSoup(source_code, "lxml")
print_reviews_count(soup)
Prints:
('ok', u'17')
('unhappy', u'2')
('happy', u'25')
答案 1 :(得分:0)
I see two possible solutions:
- Add another html class if you can.
or
- Search for the class "sprite-sr2-face-cry2" in the line before the one where you found "sr2_score_m".
To do this you could create a list of strings from your html file using .splitlines(), then iterate over it and search for both classes.
答案 2 :(得分:0)
实际上使用你们的帮助我已经设法写了一个很好的功能,应该允许我重复使用该功能的网站网址列表
import re
import urllib2
from bs4 import BeautifulSoup
website_list = [urlA, urlB....,urlX]
def ratings(website):
soup = BeautifulSoup(urllib2.urlopen(website).read())
happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").string
ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").string
unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").string
print "happy rating, " + happy.string
print "ok rating, " + ok.string
print "unhappy rating, " + unhappy.string
for website in website_list:
ratings(website)