我正在尝试从http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States
抓取数据表我使用了以下代码:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table",{ "class" : "wikitable" })
for row in table.findAll('tr')[1:]:
col = row.findAll('th')
Vehicle = col[0].string
Year1 = col[2].string
Year2 = col[3].string
Year3 = col[4].string
Year4 = col[5].string
Year5 = col[6].string
Year6 = col[7].string
Year7 = col[8].string
Year8 = col[9].string
Year9 = col[10].string
Year10 = col[11].string
Year11 = col[12].string
Year12 = col[13].string
Year13 = col[14].string
Year14 = col[15].string
Year15 = col[16].string
Year16 = col[17].string
record =(Vehicle,Year1,Year2,Year3,Year4,Year5,Year6,Year7,Year8,Year9,Year10,Year11,Year12,Year13,Year14,Year15,Year16)
print "|".join(record)
我收到此错误
File "scrap1.ph", line 13
col = row.findAll('th')
^
IndentationError: expected an indented block
任何人都可以让我知道我做错了什么。
答案 0 :(得分:2)
除了@ traceur关于缩进错误的观点,以下是如何显着简化代码:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
soup = BeautifulSoup(mech.open(url))
table = soup.find("table", class_="wikitable")
for row in table('tr')[1:]:
print "|".join(col.text.strip() for col in row.find_all('th'))
请注意,不要使用from BeautifulSoup import BeautifulSoup
(BeautifulSoup的第3版),而是最好使用from bs4 import BeautifulSoup
(第4版),因为不再维护第3版。
另请注意,您可以将mech.open(url)
直接传递给BeautifulSoup
构造函数,而不是手动阅读它。
希望有所帮助。