Question

我正在尝试从http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States

抓取数据表

我使用了以下代码：

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table",{ "class" : "wikitable" })

for row in table.findAll('tr')[1:]:
col = row.findAll('th')
Vehicle = col[0].string
Year1 = col[2].string
Year2 = col[3].string
Year3 = col[4].string
Year4 = col[5].string
Year5 = col[6].string
Year6 = col[7].string
Year7 = col[8].string
Year8 = col[9].string
Year9 = col[10].string
Year10 = col[11].string
Year11 = col[12].string
Year12 = col[13].string
Year13 = col[14].string
Year14 = col[15].string
Year15 = col[16].string
Year16 = col[17].string
record =(Vehicle,Year1,Year2,Year3,Year4,Year5,Year6,Year7,Year8,Year9,Year10,Year11,Year12,Year13,Year14,Year15,Year16)
print "|".join(record)

我收到此错误

 File "scrap1.ph", line 13
    col = row.findAll('th')
      ^
IndentationError: expected an indented block

任何人都可以让我知道我做错了什么。

Answer 1

除了@ traceur关于缩进错误的观点，以下是如何显着简化代码：

from mechanize import Browser
from bs4 import BeautifulSoup

mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
soup = BeautifulSoup(mech.open(url))
table = soup.find("table", class_="wikitable")

for row in table('tr')[1:]:
    print "|".join(col.text.strip() for col in row.find_all('th'))

请注意，不要使用from BeautifulSoup import BeautifulSoup（BeautifulSoup的第3版），而是最好使用from bs4 import BeautifulSoup（第4版），因为不再维护第3版。

另请注意，您可以将mech.open(url)直接传递给BeautifulSoup构造函数，而不是手动阅读它。

希望有所帮助。

无法读取HTML抓取的列

1 个答案: