使用beautifulsoup Python检查HTML中是否存在特定的类和值

时间:2018-05-25 13:34:05

标签: python selenium-webdriver beautifulsoup selenium-chromedriver

我正在为网站“yelp.fr”编写一个疤痕脚本,但要废弃该类自动生成的星数: class =“i-stars i-stars - regular-4 rating-large”==> 4开始 class =“i-stars i-stars - regular-3-half rating-large”==> 3.5

我的问题我怎么能这样做?如何在html页面上存在或不存在类

CITIES = "la rochelle(17000)"
places = "Bars"
driver = webdriver.Chrome()
driver.get("https://www.yelp.fr/search?find_desc="+places+"&find_loc="+CITIES+"")
page = driver.page_source
soup = BeautifulSoup(page,"lxml")
etoiles=soup.find_all("div",{"class":"biz-rating biz-rating-large clearfix"})

etoiles.get_attribute("title")
if etoiles:
    print "ok"
else:
    print "not "

有些时候,类商业评级商业评级 - 大清晰度不存在如下 enter image description here

3 个答案:

答案 0 :(得分:0)

title的{​​{1}}包含星数/等级。你可以像

那样得到它
DIV

答案 1 :(得分:0)

我用这个来解决问题:

yelp_url  = "https://www.yelp.com/search?find_desc=%s&find_loc=%s&start=%s"%(place,city,str(id))

        headers1 = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
        response1 = requests.get(yelp_url).text
        parser = html.fromstring(response1)
        print "Parsing the page"
        listing1 = parser.xpath("//li[@class='regular-search-result']")
for results in listing1:
if raw_ratings:
                        ratings = re.findall("\d+[.,]?\d+",cleaned_ratings)[0]
                    else:
                        ratings = 0
                    price_range = len(''.join(raw_price_range)) if raw_price_range else 0
                    address  = ' '.join(' '.join(raw_address).split())
                    address=unidecode(address)
                    reservation_available = True if is_reservation_available else False
                    accept_pickup = True if is_accept_pickup else False

答案 2 :(得分:0)

raw_review_count = results.xpath(".//span[contains(@class,'review-count')]//text()")
                    raw_price_range = results.xpath(".//span[contains(@class,'price-range')]//text()")
if raw_ratings:
                        ratings = re.findall("\d+[.,]?\d+",cleaned_ratings)[0]
                    else:
                        ratings = 0
                    price_range = len(''.join(raw_price_range)) if raw_price_range else 0