Question

我尝试使用以下代码从wikipedia中提取表格：

import urllib2

from bs4 import BeautifulSoup

file = open('belarus_wiki.txt', 'w')

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

country = ""
visa = ""
notes = ""

table = soup.find("table", "sortable wikitable")
for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        country = cells[0].findAll(text=True)
        visa = cells[1].findAll(text=True)
        notes = cells[2].find(text=True)

        print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")

        file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n')

file.close()

但是我看到了错误消息：

Traceback (most recent call last):
File "...\belarus_wiki.py", line 27, in <module>
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")
IndexError: list index out of range

请告诉我如何从这些单元格中提取所有文本？

Answer 1

您可以使用：

for line in table.findAll('tr'):
    for l in line.findAll('td'):
        if l.find('sup'):
           l.find('sup').extract()
        print l.getText(),'|',
    print

这里是它打印内容的摘录：

 Romania | Visa required |  |
 Russia | Freedom of movement |  |
 Rwanda | Visa required | Visa is obtained online. |
 Saint Kitts and Nevis | Visa required | Visa obtainable online. |
 Saint Lucia | Visa required |  |
 Saint Vincent and the Grenadines | Visa not required | 1 month |
 Samoa | Visa on arrival !Entry Permit on arrival | 60 days |
 San Marino | Visa required |  |
 São Tomé and Príncipe | Visa required | Visa is obtained online. |
 Saudi Arabia | Visa required |  |
 Senegal | Visa required |  |
 Serbia | Visa not required | 30 days |
 Seychelles | Visa on arrival !Visitor's Permit on arrival | 1 month |
 Sierra Leone | Visa required |  |
 Singapore | Visa required | May obtain online. |
 Slovakia | Visa required |  |
 Slovenia | Visa required |  |

Answer 2

<强>错误：

print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")

<强>正确：

if notes is None:
    print country[1].encode("utf-8"), visa[0].encode("utf-8")
else:
    print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8")

完整代码：

import urllib2

from bs4 import BeautifulSoup

file = open('belarus_wiki.txt', 'w')

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

country = ""
visa = ""
notes = ""

table = soup.find("table", "sortable wikitable")
for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        country = cells[0].findAll(text=True)
        visa = cells[1].findAll(text=True)
        notes = cells[2].find(text=True)
        if notes is None:
            print country[1].encode("utf-8"), visa[0].encode("utf-8")
            file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n')
        else:
            print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8")
            file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + ',' + notes.encode("utf-8") + '\n')

我的环境：
OS X 10.10.1
Python 2.7.8
BeautifulSoup 4.1.3

Python，beautifulsoup：从表格单元格中提取文本

2 个答案: