我想打印"美国"和加利福尼亚分开的线路如
Country is : United State
State is : California
问题是每个列表项具有相同的类和ID,因此当我循环遍历列表项时,它会将美国和加利福尼亚放在一起。
我希望你们能理解我想说的话。
<ul class="breadcrumbs" id="BREADCRUMBS">
<li class="breadcrumb_item " itemscope="" itemtype="http://data- vocabulary.org/Breadcrumb">
<a class="breadcrumb_link" href="/Tourism-g191-United_States-Vacations.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'Country', 1, this.href); ">
<span itemprop="title">United States</span>
</a>
<span class="separator">›</span>
</li>
<li class="breadcrumb_item " itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb">
<a class="breadcrumb_link" href="/Tourism-g28926-California-Vacations.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'State', 2, this.href); ">
<span itemprop="title">California (CA)</span>
</a>
<span class="separator">›</span>
</li>
<li class="breadcrumb_item " itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb"><a class="breadcrumb_link" href="/Tourism-g32655-Los_Angeles_California-Vacations.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'City', 3, this.href); "><span itemprop="title">Los Angeles</span></a><span class="separator">›</span>
</li>
<li class="breadcrumb_item " itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb"><a class="breadcrumb_link" href="/Restaurants-g32655-Los_Angeles_California.html" itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', '', 4, this.href); return setOneTimeCookie('mcreset','true');"><span itemprop="title">Los Angeles Restaurants</span></a>
<span class="separator">›</span>
</li>
<li class="breadcrumb_item ">Providence</li>
</ul>
&#13;
here is my scraping script with python beautifulsoup
import sys,io,csv,requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.tripadvisor.com/Restaurant_Review-g32655-d594024-Reviews-Providence-Los_Angeles_California.html"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("body", {"class": "ltr domn_en_US lang_en globalNav2011_reset hr_tabs_placement_test tabs_below_meta scroll_tabs full_width_page content_blocks css_commerce_buttons flat_buttons sitewide xo_pin_user_review_to_top track_back"})
for div in maindiv:
divone = soup.find_all("div", {"id": "PAGE"})
for listitem in divone:
div12 = soup.find_all("div", {"class": "breadCrumbBackground blue bgwhite "})
for listitem in div12:
ulpart = soup.find_all("ul", {"class": "breadcrumbs"})
for unorder in ulpart[0]:
div2 = soup.find_all("li", {"class": "breadcrumb_item "})
for listitem in div2:
tag = soup.find_all("a", {"class": "breadcrumb_link"})
for spandiv in tag:
span = soup.find_all("span", {"itemprop": "title"})
for country_name in span:
print(country_name.text)
&#13;
答案 0 :(得分:1)
您有onclick
属性的相关部分,用于定义面包屑是国家/地区,哪个是州。我会通过*=
CSS selector使用部分匹配来实现:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""Your HTML"""
soup = BeautifulSoup(data, "html.parser")
country = soup.select_one("li.breadcrumb_item a[onclick*=Country]").get_text(strip=True)
state = soup.select_one("li.breadcrumb_item a[onclick*=State]").get_text(strip=True)
print("The country is: '%s'" % country)
print("The state is: '%s'" % state)
打印:
The country is: 'United States'
The state is: 'California (CA)'