我正在尝试使用mechanize和BeautifulSoup解析一个网站而没有任何运气,我知道可以访问网站表,因为我可以阅读并打印整个页面...用户代理未在此处发布。
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table", id="table-hover")
for row in table.findAll('tr')[1:]:
col = row.findAll('th')
time = col[0].string
ais_source = col[1].string
speed_km = col[2].string
lat = col[3].string
lon = col[4].string
course = col[5].string
record = ( time, ais_source, speed_km, lat, lon, course )
print "|".join(record)
当我运行此代码时,我收到错误" NoneType对象没有属性' findAll'我无法找到该页面的唯一表格标识符。
答案 0 :(得分:1)
您需要提供用户代理:
url = "http://www.marinetraffic.com/en/ais/index/positions/all/shipid:415660/mmsi:354975000/shipname:ADESSA%20OCEAN%20KING/_:6012a2741fdfd2213679de8a23ab60d3"
import requests
headers = {'User-agent': 'Mozilla/5.0'}
html = requests.get(url,headers=headers).content
soup = BeautifulSoup(html)
table = soup.find("table") # only one table
所以只需用以下内容解压缩列表:
for row in table.findAll('tr')[1:]:
items = row.text.replace(u"kn","") # remove kn so items line up when unpacking
time, ais_source, speed_km, lat, lon, course = items.split()[1:7]
print(time,ais_source,speed_km,lat,lon,course)
(u'21:40', u'T-AIS', u'0', u'6.422732', u'3.406325', u'327')
(u'21:17', u'T-AIS', u'0.1', u'6.42272', u'3.406313', u'311')
(u'20:53', u'T-AIS', u'0', u'6.422688', u'3.406312', u'321')
(u'20:30', u'T-AIS', u'0', u'6.422668', u'3.4063', u'324')
(u'20:07', u'T-AIS', u'0.1', u'6.42266', u'3.406287', u'323')
(u'19:44', u'T-AIS', u'0', u'6.422685', u'3.406273', u'320')
(u'19:20', u'T-AIS', u'0.1', u'6.422687', u'3.406297', u'316')
(u'18:57', u'T-AIS', u'0.1', u'6.422675', u'3.406292', u'308')
(u'18:34', u'T-AIS', u'0.1', u'6.422658', u'3.406327', u'312')
(u'18:10', u'T-AIS', u'0.1', u'6.422723', u'3.406318', u'317')
没有它你会收到403错误:
<html><body><h1>403 Forbidden</h1>
Request forbidden by administrative rules.
</body></html>