我正在使用BeautifulSoap和Python从this website 抓取数据 使用 find_all 我已经在表中提取了表行,现在我想为每一行提取表数据,并将每条数据行添加为数组中的行。例如:
[12:55, Beverello, Ischia, http://alilauronew.forth-crs.gr/italian_b2c
/npgres.exe?PM=BM;...., ....]
欢迎您提供任何有关此操作的帮助。 这是提取的数据的示例:
[<tr class="grey1">
<td class="orario">12:55</td>
<td class="aL">Beverello</td>
<td class="aL">Ischia</td>
<td>
<div class="circle green">
<span class="tt-container">
<span class="tt-arrow"></span>
<span class="tt-text">corsa regolare</span>
</span>
</div>
</td>
<td><a href="http://alilauronew.forth-crs.gr/italian_b2c
/npgres.exe?PM=BM" target="_blank"><img src="/templates
/frontend/images/carrello.png"/></a></td>
</tr>
[<tr class="grey1">
<td class="orario">14:45</td>
<td class="aL">Ischia</td>
<td class="aL">Beverello</td>
<td>
<div class="circle green">
<span class="tt-container">
<span class="tt-arrow"></span>
<span class="tt-text">corsa regolare</span>
</span>
</div>
</td>
<td><a href="http://alilauronew.forth-crs.gr/italian_b2c
/npgres.exe?PM=BM" target="_blank"><img src="/templates
/frontend/images/carrello.png"/></a></td>
</tr>
答案 0 :(得分:2)
from bs4 import BeautifulSoup
html="""
<tr class="grey1">
<td class="orario">12:55</td>
<td class="aL">Beverello</td>
<td class="aL">Ischia</td>
<td>
<div class="circle green">
<span class="tt-container">
<span class="tt-arrow"></span>
<span class="tt-text">corsa regolare</span>
</span>
</div>
</td>
<td><a href="http://alilauronew.forth-crs.gr/italian_b2c
/npgres.exe?PM=BM" target="_blank"><img src="/templates
/frontend/images/carrello.png"/></a></td>
</tr>
"""
soup = BeautifulSoup(html, "lxml")
row=[td.text.strip() for td in soup.findAll('td')]
print(row)
输出:
['12:55', 'Beverello', 'Ischia', 'corsa regolare', '']
答案 1 :(得分:1)
这是一个完全正常的示例,data
列表将包含您想要的所有内容而没有杂音(空字符串等)
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.alilauro.it').text
bs = BeautifulSoup(response)
data = []
# I don't want to scrape the headers, so I'm slicing the list, emitting the first element
no_header = list(bs.select('#partenze tr'))[1:]
for tr in no_header:
td = tr.select('td')
data.append({
'ORA':td[0].text,
'PARTENZA DA':td[1].text,
'ARRIVO A':td[2].text,
'ACQUISTA':td[4].select('a')[0].attrs['href']
})
print(data)
注意:
requests
库发出http请求,您可以使用任何想要的内容select
内置的bs只是个人选择