使用beautifulsoup python从页面中抓取特定元素时出现问题

时间:2017-11-14 06:23:29

标签: python beautifulsoup

我是python的新手,正在研究使用python beautifulsoup库抓取HTML。

我需要获取日期字段值作为日期和日期以及沉降字段值以及测量单位。

Python代码



    dates=[]
Precip=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    th_cells=row.findAll('th') #To store second column data
    if len(cells)==5:
        Precip.append(cells[1].find(text=True))
        dates.append(th_cells[0].find(text=True))
print(dates)
print(Precip)




代码输出

['Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ', 'Fri ', 'Sat ', 'Sun ', 'Mon ', 'Tue ', 'Wed ', 'Thu ']
['0 ', '0 ', '0 ', '1 ', '3 ', '3 ', '13 ', '0 ', '0 ', '0 ', '0 ', '0 ', '\xa0', '1 ', '3 ', '0 ', '1 ', '4 ', '2 ', '9 ', '2 ', '0 ', '1 ', '0 ', '0 ', '0 ', '0 ', '0 ', '1 ', '2 ']

所需输出

['Wed 11/1','Thur 11/2'.......]

['0mm','0mm'....]

下面是我试图解析的HTML

HTML



<class 'list'>: ['\n', <thead>
<tr>
<th>Date</th>
<th>Hi/Lo</th>
<th>Precip</th>
<th>Snow</th>
<th>Forecast</th>
<th>Avg. HI / LO</th>
</tr>
</thead>, '\n', <tbody>
<tr class="pre">
<th scope="row">Wed <time>11/1</time></th>
<td>25°/20°</td>
<td>0 <span class="small">mm</span></td>
<td>0 <span class="small">CM</span></td>
<td> </td>
<td>28°/18°</td>
</tr>
<tr class="pre">
<th scope="row">Thu <time>11/2</time></th>
<td>28°/19°</td>
<td>0 <span class="small">mm</span></td>
<td>0 <span class="small">CM</span></td>
<td> </td>
<td>27°/18°</td>
</tr>
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:3)

我使用.text代替.find(text=true)。目前正在发生的事情是您没有获取子标记的内容,例如<time>

from bs4 import BeautifulSoup
import requests

html = requests.get("https://www.accuweather.com/en/in/bengaluru/204108/month/204108?view=table").text
soup = BeautifulSoup(html, 'html.parser')



right_table = soup.find("tbody")
dates=[]
Precip=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    th_cells=row.findAll('th') #To store second column data
    if len(cells)==5:
        Precip.append(cells[1].text)
        dates.append(th_cells[0].text)
print(dates)
print(Precip)

这会得到正确的输出结果:

['Wed 11/1', 'Thu 11/2', 'Fri 11/3', 'Sat 11/4', 'Sun 11/5', 'Mon 11/6', 'Tue 11/7', 'Wed 11/8', 'Thu 11/9', 'Fri 11/10', 'Sat 11/11', 'Sun 11/12', 'Mon 11/13', 'Tue 11/14', 'Wed 11/15', 'Thu 11/16', 'Fri 11/17', 'Sat 11/18', 'Sun 11/19', 'Mon 11/20', 'Tue 11/21', 'Wed 11/22', 'Thu 11/23', 'Fri 11/24', 'Sat 11/25', 'Sun 11/26', 'Mon 11/27', 'Tue 11/28', 'Wed 11/29', 'Thu 11/30']
['0 mm', '0 mm', '0 mm', '1 mm', '3 mm', '3 mm', '13 mm', '0 mm', '0 mm', '0 mm', '0 mm', '0 mm', '\xa0', '1 mm', '3 mm', '0 mm', '1 mm', '4 mm', '2 mm', '9 mm', '2 mm', '0 mm', '1 mm', '0 mm', '0 mm', '0 mm', '0 mm', '0 mm', '1 mm', '2 mm']