在表

时间:2017-05-21 23:49:15

标签: python-3.x

我正在废弃某个网站以获取某些数据。 我的代码可以正常工作。它找到我想要的特定表和行,然后选择单元格并将它们放入dict中。我的问题是连续选择最后一个单元格。

import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
import pandas as pd

theurl = "http://www.nationsonline.org/oneworld/IATA_Codes/airport_code_list.htm"
thepage = urllib
thepage = urllib.request.urlopen(theurl)
soup=BeautifulSoup(thepage, "html.parser")
air=[]
init_data = open('/Users/paribaker/Desktop/air.txt', 'a')
count = 0
while count <73:
    title = soup.find_all('table',{'class':'tb86'})[(count)]
    rows = title.findAll('tr')[1:]
    data = {
        'city' : [],
        'country' : [],
        'code' :[]

        }
    for row in rows:
        col1 = row.find_all('td')[0]
        col2 = row.find_all('td')[1]
        col3 = row.find_all('td')[2]
        print (col1.text)
        print(col2.text)
        print(col3.text)
        #col3 = row.find_all('td')[1]
        #data['city'].append( col1.get_text())
        #data['country'].append( col2)
        #data['code'].append( col3)
        #dogData = pd.DataFrame(data)
        #dogData.to_csv("dog.csv")
    count += 3

我收到错误,说td [2]不在范围内。当我看到td的选择器时,它说它是第3个,所以我会使用[2]。 任何的想法。

1 个答案:

答案 0 :(得分:0)

一些调试语句显示某些行中只有两个<td>个单元格。事实上,对于row中的第一个rows来说,情况确实如此:

for i, row in enumerate(rows):
    print("Row {}:\n".format(i))
    for j, td in enumerate(row.find_all('td')):
        print(" Cell {}:\n{}".format(j, td))
    try:
        col3 = row.find_all('td')[2]
    except IndexError as e:
        print("ERROR on Row {}: {}".format(i, e))
        break

输出:

Row 0:

 Cell 0:
<td style="width:730px;"><script async="" src="http://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script><!-- Top-Banner 728x90, Erstellt 25.12.09 --><ins class="adsbygoogle" data-ad-client="ca-pub-7193398479241689" data-ad-slot="6570665833" style="display:inline-block;width:728px;height:90px"></ins><script>(adsbygoogle = window.adsbygoogle || []).push({});</script></td>

 Cell 1:
<td class="logotd"><a href="/oneworld/first.shtml"><img alt="Nations Online Logo" class="displayed" height="60" src="/buttons/OWNO_logo06-60.png" width="60"/>    </a><br><b>One World<br>Nations Online</br></b></br></td>

ERROR on Row 0: list index out of range

也许页面中有一些<td>元素你想要跳过?

<强>更新
这是缩小您获得的刮擦输出的一种方法。看起来您感兴趣的单元格都是类border1的成员。您可以筛选包含具有此类的单元格的行:

for row in rows:
    target_row = row.find_all('td', class_="border1")
    if len(target_row) == 3:
        city, country, code = [td.text for td in target_row]
        print("City: {}, Country: {}, Code: {}".format(city, country, code))

输出:

City: Aarhus, Country: Denmark, Code: AAR
City: Abadan, Country: Iran, Code: ABD
City: Abeche, Country: Chad, Code: AEH
...
City: Zinder, Country: Niger, Code: ZND
City: Zouerate, Country: Mauritania, Code: OUZ
City: Zurich (Zürich) - Kloten, Country: Switzerland, Code: ZRH