从<a> using BeautifulSoup (in between two other tags)

时间:2018-02-16 23:18:14

标签: python beautifulsoup screen-scraping

Please can you help me to solve a problem in Python based on this html code:

<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>

I'm trying to grab the strings (Text1, Text2 ...) as well as the href links in between the two h2 tags.

Grabbing the strings worked fine by jumping to the h2 tag (with string="One") and then walking through the siblings until reaching the next h2 node while grabbing everything on the way.

page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

education = []
edu = soup.find("h2", string="One")
for elt in edu.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        education.append(elt.text + "\n")
print("".join(education))

I can't manage to replicate this in order to collect the links from the <a>-tag in an additional list. I was amateurishly going for stuff like education2.append(elt2.get("href")) with very limited success. Any ideas?

Thank you!!

4 个答案:

答案 0 :(得分:2)

你可以试试这个:

from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all('a')]

输出:

[u'Text1', u'Text2', u'Text3']

答案 1 :(得分:2)

改善@ Ajax1234的答案;这只会找到具有itemprop属性的标记。见find_all()

from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all("a", attrs={"itemprop": "affiliation"})]

答案 2 :(得分:2)

你非常接近做你想做的事。我做了一些改变。

这将给你想要的东西:

html = '''<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div>
<div>dummy</div>
<h2 class="sectionTitle">Two</h2>'''

soup = BeautifulSoup(html, 'lxml')
texts = []
links = []
for tag in soup.find('h2', text='One').find_next_siblings():
    if tag.name == 'h2':
        break
    a = tag.find('a', itemprop='affiliation', href=True, text=True)
    if a:
        texts.append(a.text)
        links.append(a['href'])

print(texts, links, sep='\n')

输出:

['Text1', 'Text2', 'Text3']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']

我添加了一个没有子标记的虚拟<div>标记,以表明代码不会因任何其他情况而失败。

如果HTML除了您想要的<a>之外没有itemprop="affiliation"个标签,您可以直接使用:

texts = [x.text for x in soup.find_all('a', itemprop='affiliation', text=True)]
links = [x['href'] for x in soup.find_all('a', itemprop='affiliation', href=True)]

答案 3 :(得分:1)

我解决问题的方法如下:

from bs4 import BeautifulSoup
html = '''
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
'''
soup = BeautifulSoup(html, "html.parser")

# Extract the texts
result1 = [i.text.strip('\n') for i in soup.find_all('div')]
print(result1)

# Extract the HREF links
result2 = [j['href'] for j in soup.find_all('a',href=True)]
print(result2)

列表result1将输出<div>标记之间的文字列表,而列表result2将输出href links内的<a>列表}标签。

<强>输出:

['Text1', 'Text2', 'Text3', 'Two']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']

希望这个解决方案解决问题!