我正在尝试在BeautifulSoup中抓取以下HTML。
<div …. > <div…..>
<div class=“class1">Jill</div> <div class=“class2">50</div>
<div class=“class1">Jane</div>
<div class=“class1">Joe</div> <div class=“class2">12</div>
</div></div>
并不是每个人都有第二个要刮的东西,所以汤.find_all(“ div”,attrs = {“ class”:“ class2”})之类的东西将无法正常工作(它将同时返回50和12,但返回12与正确的人没有联系
想要的结果(在变量中):
Jill 50
Jane
Joe 12
答案 0 :(得分:1)
您可以获取所有name('class1')元素,并检查它们是否具有相应的age('class2')元素。
from bs4 import BeautifulSoup
html = """
<div class='parent'>
<div class="class1">Jill</div> <div class="class2">50</div>
<div class="class1">Jane</div>
<div class="class1">Joe</div> <div class="class2">12</div>
</div>
"""
soup = BeautifulSoup(html)
name_tags = soup.find_all('div', {'class': 'class1'})
name_age_pairs = []
# Iterate through all 'class1' elements and see if the next sibling is 'class2'
for name_tag in name_tags:
name_next_div = name_tag.find_next('div')
age = None
if 'class2' in name_next_div['class']:
age = int(name_next_div.string)
name_age_pairs.append((name_tag.string, age))
print(name_age_pairs)
name_age_pairs
将包含:
[('Jill', 50), ('Jane', None), ('Joe', 12)]
“无”表示第二人没有年龄。
答案 1 :(得分:0)
尝试一下:
pairs = []
for div in soup.find_all('div', {'class': 'class1'}):
name = div.text
item = ''
tmp = div.find_next('div')
if 'class2' in tmp['class']:
item = tmp.text
pairs.append([name, item])
答案 2 :(得分:0)
这是我最终使用的。适用于类名称中的多个值和空格。
# default values for vars
Item1 = Item2 = Item3 = ""
for item in soup.find_all('div'):
# convert to str for comparison reasons
strItem = str(item)
if strItem.find("class1") > 0 and item.string != None:
if Item1 != "": # if you have None as default change this
print(Item1, Item2, Item3) # or make list, dict, json, csv, sql......
Item2 = Item3 = "" # default values for vars
Item1 = item.string
elif strItem.find("class2") > 0 and item.string != None:
Item2 = item.string
elif strItem.find("class3") > 0 and item.string != None:
Item3 = item.string
# and so on....
# don't forget to process the last one...
print(Item1, Item2, Item3) # # or make list, dict, json, csv, sql......