我正在废弃某个网站以获取某些数据。 我的代码可以正常工作。它找到我想要的特定表和行,然后选择单元格并将它们放入dict中。我的问题是连续选择最后一个单元格。
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
import pandas as pd
theurl = "http://www.nationsonline.org/oneworld/IATA_Codes/airport_code_list.htm"
thepage = urllib
thepage = urllib.request.urlopen(theurl)
soup=BeautifulSoup(thepage, "html.parser")
air=[]
init_data = open('/Users/paribaker/Desktop/air.txt', 'a')
count = 0
while count <73:
title = soup.find_all('table',{'class':'tb86'})[(count)]
rows = title.findAll('tr')[1:]
data = {
'city' : [],
'country' : [],
'code' :[]
}
for row in rows:
col1 = row.find_all('td')[0]
col2 = row.find_all('td')[1]
col3 = row.find_all('td')[2]
print (col1.text)
print(col2.text)
print(col3.text)
#col3 = row.find_all('td')[1]
#data['city'].append( col1.get_text())
#data['country'].append( col2)
#data['code'].append( col3)
#dogData = pd.DataFrame(data)
#dogData.to_csv("dog.csv")
count += 3
我收到错误,说td [2]不在范围内。当我看到td的选择器时,它说它是第3个,所以我会使用[2]。 任何的想法。
答案 0 :(得分:0)
一些调试语句显示某些行中只有两个<td>
个单元格。事实上,对于row
中的第一个rows
来说,情况确实如此:
for i, row in enumerate(rows):
print("Row {}:\n".format(i))
for j, td in enumerate(row.find_all('td')):
print(" Cell {}:\n{}".format(j, td))
try:
col3 = row.find_all('td')[2]
except IndexError as e:
print("ERROR on Row {}: {}".format(i, e))
break
输出:
Row 0:
Cell 0:
<td style="width:730px;"><script async="" src="http://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script><!-- Top-Banner 728x90, Erstellt 25.12.09 --><ins class="adsbygoogle" data-ad-client="ca-pub-7193398479241689" data-ad-slot="6570665833" style="display:inline-block;width:728px;height:90px"></ins><script>(adsbygoogle = window.adsbygoogle || []).push({});</script></td>
Cell 1:
<td class="logotd"><a href="/oneworld/first.shtml"><img alt="Nations Online Logo" class="displayed" height="60" src="/buttons/OWNO_logo06-60.png" width="60"/> </a><br><b>One World<br>Nations Online</br></b></br></td>
ERROR on Row 0: list index out of range
也许页面中有一些<td>
元素你想要跳过?
<强>更新强>
这是缩小您获得的刮擦输出的一种方法。看起来您感兴趣的单元格都是类border1
的成员。您可以筛选包含具有此类的单元格的行:
for row in rows:
target_row = row.find_all('td', class_="border1")
if len(target_row) == 3:
city, country, code = [td.text for td in target_row]
print("City: {}, Country: {}, Code: {}".format(city, country, code))
输出:
City: Aarhus, Country: Denmark, Code: AAR
City: Abadan, Country: Iran, Code: ABD
City: Abeche, Country: Chad, Code: AEH
...
City: Zinder, Country: Niger, Code: ZND
City: Zouerate, Country: Mauritania, Code: OUZ
City: Zurich (Zürich) - Kloten, Country: Switzerland, Code: ZRH