我有以下HTML代码:
<div class="info">
<div class="left-wrap"><span class="date">DATE-1</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client1</span>
<span class="client" >client2</span>
<span class="client" >client3</span>
</div>
</div>
<div class="info">
<div class="left-wrap"><span class="date" >DATE-2</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client4</span>
<span class="client" >client5</span>
</div>
</div>
我想获取与每个日期相关的数据,我做了以下工作:
date = []
clients = []
for item in soup.find_all(class_='date'):
date.append(item.get_text().strip())
for item in soup.find_all(class_='client'):
clients.append(item.get_text().strip())
print date
print clients
最后我得到的是包含“ date1”和“ date2”的日期列表,以及包含从client1到client5的客户端列表。
我的问题是我无法映射具有date的客户端,例如client1,client2和client3并与date1相关,但是我仍然找不到要知道每个日期下有多少个客户端的信息。
答案 0 :(得分:2)
您可以使用itertools.groupby
:
from bs4 import BeautifulSoup as soup
import itertools as it, re
data = soup(html, 'html.parser').find_all('span', {'class':re.compile('client|date')})
r = [[i.text for i in b] for _, b in it.groupby(data, key=lambda x:x['class'][0] == 'client')]
result = {r[i][0]:r[i+1] for i in range(0, len(r), 2)}
输出:
{'DATE-1': ['client1', 'client2', 'client3'], 'DATE-2': ['client4', 'client5']}
答案 1 :(得分:2)
尝试一下。使用find_next
()查找下一个div标签,然后使用find_all()span标签。
from bs4 import BeautifulSoup
html='''<div class="info">
<div class="left-wrap"><span class="date">DATE-1</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client1</span>
<span class="client" >client2</span>
<span class="client" >client3</span>
</div>
</div>
<div class="info">
<div class="left-wrap"><span class="date" >DATE-2</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client4</span>
<span class="client" >client5</span>
</div>
</div>'''
soup=BeautifulSoup(html,'html.parser')
dates=soup.find_all(class_='date')
for date in dates:
print(date.text)
for item in date.find_next(class_='clients-list').find_all(class_='client'):
print(item.text)
输出:
DATE-1
client1
client2
client3
DATE-2
client4
client5