如何从Python中具有相同id和类的div中获取特定元素

时间:2016-09-01 16:12:08

标签: python html5 python-3.x beautifulsoup

我想打印"美国"和加利福尼亚分开的线路如

Country is : United State
State is : California

问题是每个列表项具有相同的类和ID,因此当我循环遍历列表项时,它会将美国和加利福尼亚放在一起。

我希望你们能理解我想说的话。



<ul class="breadcrumbs" id="BREADCRUMBS">
  <li class="breadcrumb_item " itemscope="" itemtype="http://data- vocabulary.org/Breadcrumb">
    <a class="breadcrumb_link" href="/Tourism-g191-United_States-Vacations.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'Country', 1, this.href); ">
      <span itemprop="title">United States</span>
    </a>
    <span class="separator">›</span>
  </li>
  <li class="breadcrumb_item " itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb">
    <a class="breadcrumb_link" href="/Tourism-g28926-California-Vacations.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'State', 2, this.href); ">
      <span itemprop="title">California (CA)</span>
    </a>
    <span class="separator">›</span>
  </li>
  <li class="breadcrumb_item " itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb"><a class="breadcrumb_link" href="/Tourism-g32655-Los_Angeles_California-Vacations.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'City', 3, this.href); "><span itemprop="title">Los Angeles</span></a><span class="separator">›</span>
  </li>
  <li class="breadcrumb_item " itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb"><a class="breadcrumb_link" href="/Restaurants-g32655-Los_Angeles_California.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', '', 4, this.href); return setOneTimeCookie('mcreset','true');"><span itemprop="title">Los Angeles Restaurants</span></a>
    <span class="separator">›</span>
  </li>
  <li class="breadcrumb_item ">Providence</li>
</ul>
&#13;
&#13;
&#13;

here is my scraping script with python beautifulsoup

&#13;
&#13;
import sys,io,csv,requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.tripadvisor.com/Restaurant_Review-g32655-d594024-Reviews-Providence-Los_Angeles_California.html"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("body", {"class": "ltr domn_en_US lang_en globalNav2011_reset hr_tabs_placement_test tabs_below_meta scroll_tabs full_width_page content_blocks css_commerce_buttons flat_buttons sitewide xo_pin_user_review_to_top track_back"})
for div in maindiv:
	divone = soup.find_all("div", {"id": "PAGE"})
	for listitem in divone:
		div12 = soup.find_all("div", {"class": "breadCrumbBackground blue bgwhite "})
		for listitem in div12:
			ulpart = soup.find_all("ul", {"class": "breadcrumbs"})
			for unorder in ulpart[0]:
				div2 = soup.find_all("li", {"class": "breadcrumb_item "})
				for listitem in div2:
					tag = soup.find_all("a", {"class": "breadcrumb_link"})
					for spandiv in tag:
						span = soup.find_all("span", {"itemprop": "title"})
						for country_name in span:
							
							print(country_name.text)
				
				
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:1)

您有onclick属性的相关部分,用于定义面包屑是国家/地区,哪个是州。我会通过*= CSS selector使用部分匹配来实现:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""Your HTML"""

soup = BeautifulSoup(data, "html.parser")

country = soup.select_one("li.breadcrumb_item a[onclick*=Country]").get_text(strip=True)
state = soup.select_one("li.breadcrumb_item a[onclick*=State]").get_text(strip=True)

print("The country is: '%s'" % country)
print("The state is: '%s'" % state)

打印:

The country is: 'United States'
The state is: 'California (CA)'