我正在进行网页抓取,每个标签下面都有几个h4标签及其列表。我想抓取每个列表的元素,并将其分配给每个h4标签的ID。这是HTML:
<h4 class="dataHeaderWithBorder" id="Production" name="production">Production</h4>
<ul class="simpleList">
<li><a href="/company/co0308?ref_=xtco_co_1">Red Claw </a></li>
<li><a href="/company/co0386?ref_=xtco_co_2">Haven </a></li>
<li><a href="/company/co0487?ref_=xtco_co_3">Frame</a></li>
</ul>
<h4 class="dataHeaderWithBorder" id="Distribution" name="Distribution">Distribution</h4>
<ul class="simpleList">
<li><a href="/company/co0017?ref_=xtco_co_1">Broadside Attractions</a> </li>
<li><a href="/company/co0208?ref_=xtco_co_2"> Global Acquisitions</a></li>
</ul>
这就是我想要的数据:
Production, Red Claw
Production, Haven
Production, Frame
Distribution, Broadside Attractions
Distribution, Global Acquisitions
我可以同时获取两个列表的所有元素,但无法获取ID。我的代码如下:
for h4 in soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'}):
id = h4.get_text()
#print(id)
for ul in h4.find_all('ul', attrs={'class':'simpleList'}):
#print(ul)
# Find the items that mention a budget
productionCompany = ul.find_all('a')
for company in productionCompany:
text = company.get_text()
print(id, text)
productionComps.append(id, text)
我不知道如何从每个h4标签中获取ID。如果我删除前两行并将h4.find_all替换为soup.find_all,我的输出最终看起来像这样。
Red Claw
Haven
Frame
Broadside Attractions
Global Acquisition
答案 0 :(得分:1)
id = h4.get_text()
id
不是项目文字;这是一个属性。 beautifulsoup中的元素属性就像字典一样被访问。试试这个:
item_id = h4['id']
答案 1 :(得分:1)
您可以使用itertools.groupby
:
from itertools import groupby
from bs4 import BeautifulSoup as soup
import re
d = [[i.name, i.text] for i in soup(data, 'html.parser').find_all(re.compile('h4|a'))]
new_d = [list(b) for _, b in groupby(d, key=lambda x:x[0] == 'h4')]
grouped = [[new_d[i][0][-1], [a for _, a in new_d[i+1]]] for i in range(0, len(new_d), 2)]
result = '\n'.join('\n'.join(f'{a}, {i}' for i in b) for a, b in grouped)
print(result)
输出:
Production, Red Claw
Production, Haven
Production, Frame
Distribution, Broadside Attractions
Distribution, Global Acquisitions
答案 2 :(得分:1)
使用邮政编码
h4_list=soup.find_all('h4', attrs={'class':'dataHeaderWithBorder'})
ul_list=soup.find_all('ul', attrs={'class':'simpleList'})
productionComps=[]
for h4,ul in zip(h4_list,ul_list):
id_ = h4.get_text()
productionCompany = ul.find_all('a')
for company in productionCompany:
text = company.get_text()
print(id_, text)
productionComps.append((id_, text))