我尝试使用以下代码从wikipedia中提取表格:
import urllib2
from bs4 import BeautifulSoup
file = open('belarus_wiki.txt', 'w')
url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
country = ""
visa = ""
notes = ""
table = soup.find("table", "sortable wikitable")
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 3:
country = cells[0].findAll(text=True)
visa = cells[1].findAll(text=True)
notes = cells[2].find(text=True)
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")
file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n')
file.close()
但是我看到了错误消息:
Traceback (most recent call last):
File "...\belarus_wiki.py", line 27, in <module>
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")
IndexError: list index out of range
请告诉我如何从这些单元格中提取所有文本?
答案 0 :(得分:3)
您可以使用:
for line in table.findAll('tr'):
for l in line.findAll('td'):
if l.find('sup'):
l.find('sup').extract()
print l.getText(),'|',
print
这里是它打印内容的摘录:
Romania | Visa required | |
Russia | Freedom of movement | |
Rwanda | Visa required | Visa is obtained online. |
Saint Kitts and Nevis | Visa required | Visa obtainable online. |
Saint Lucia | Visa required | |
Saint Vincent and the Grenadines | Visa not required | 1 month |
Samoa | Visa on arrival !Entry Permit on arrival | 60 days |
San Marino | Visa required | |
São Tomé and Príncipe | Visa required | Visa is obtained online. |
Saudi Arabia | Visa required | |
Senegal | Visa required | |
Serbia | Visa not required | 30 days |
Seychelles | Visa on arrival !Visitor's Permit on arrival | 1 month |
Sierra Leone | Visa required | |
Singapore | Visa required | May obtain online. |
Slovakia | Visa required | |
Slovenia | Visa required | |
答案 1 :(得分:0)
<强>错误:强>
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes[0].encode("utf-8")
<强>正确:强>
if notes is None:
print country[1].encode("utf-8"), visa[0].encode("utf-8")
else:
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8")
完整代码:
import urllib2
from bs4 import BeautifulSoup
file = open('belarus_wiki.txt', 'w')
url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
country = ""
visa = ""
notes = ""
table = soup.find("table", "sortable wikitable")
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 3:
country = cells[0].findAll(text=True)
visa = cells[1].findAll(text=True)
notes = cells[2].find(text=True)
if notes is None:
print country[1].encode("utf-8"), visa[0].encode("utf-8")
file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + '\n')
else:
print country[1].encode("utf-8"), visa[0].encode("utf-8"), notes.encode("utf-8")
file.write(country[1].encode("utf-8") + ',' + visa[0].encode("utf-8") + ',' + notes.encode("utf-8") + '\n')
我的环境:
OS X 10.10.1
Python 2.7.8
BeautifulSoup 4.1.3