提取抓取的BeautifulSoap HTML数据

时间:2018-12-23 12:57:13

标签: python web-scraping beautifulsoup

我正在使用BeautifulSoap和Python从this website 抓取数据 使用 find_all 我已经在表中提取了表行,现在我想为每一行提取表数据,并将每条数据行添加为数组中的行。例如:

[12:55, Beverello, Ischia, http://alilauronew.forth-crs.gr/italian_b2c 
/npgres.exe?PM=BM;...., ....]

欢迎您提供任何有关此操作的帮助。 这是提取的数据的示例:

   [<tr class="grey1">
        <td class="orario">12:55</td>
        <td class="aL">Beverello</td>
        <td class="aL">Ischia</td>
        <td>
            <div class="circle green">
               <span class="tt-container">
                   <span class="tt-arrow"></span>
                   <span class="tt-text">corsa regolare</span>
               </span>
            </div>
        </td>
       <td><a href="http://alilauronew.forth-crs.gr/italian_b2c     
            /npgres.exe?PM=BM" target="_blank"><img src="/templates 
            /frontend/images/carrello.png"/></a></td>
 </tr>
   [<tr class="grey1">
        <td class="orario">14:45</td>
        <td class="aL">Ischia</td>
        <td class="aL">Beverello</td>
        <td>
            <div class="circle green">
               <span class="tt-container">
                   <span class="tt-arrow"></span>
                   <span class="tt-text">corsa regolare</span>
               </span>
            </div>
        </td>
       <td><a href="http://alilauronew.forth-crs.gr/italian_b2c     
            /npgres.exe?PM=BM" target="_blank"><img src="/templates 
            /frontend/images/carrello.png"/></a></td>
 </tr>

2 个答案:

答案 0 :(得分:2)

from bs4 import BeautifulSoup
html="""
<tr class="grey1">
        <td class="orario">12:55</td>
        <td class="aL">Beverello</td>
        <td class="aL">Ischia</td>
        <td>
            <div class="circle green">
               <span class="tt-container">
                   <span class="tt-arrow"></span>
                   <span class="tt-text">corsa regolare</span>
               </span>
            </div>
        </td>
       <td><a href="http://alilauronew.forth-crs.gr/italian_b2c
            /npgres.exe?PM=BM" target="_blank"><img src="/templates
            /frontend/images/carrello.png"/></a></td>
 </tr>
"""
soup = BeautifulSoup(html, "lxml")
row=[td.text.strip() for td in soup.findAll('td')]
print(row)

输出:

['12:55', 'Beverello', 'Ischia', 'corsa regolare', '']

答案 1 :(得分:1)

这是一个完全正常的示例,data列表将包含您想要的所有内容而没有杂音(空字符串等)

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.alilauro.it').text
bs = BeautifulSoup(response)

data = []
# I don't want to scrape the headers, so I'm slicing the list, emitting the first element
no_header = list(bs.select('#partenze tr'))[1:]
for tr in no_header:
    td = tr.select('td')
    data.append({
        'ORA':td[0].text,
        'PARTENZA DA':td[1].text,
        'ARRIVO A':td[2].text,
        'ACQUISTA':td[4].select('a')[0].attrs['href']
    })

print(data)

注意:

  • 我使用requests库发出http请求,您可以使用任何想要的内容
  • 我使用css选择器,使用select内置的bs只是个人选择