在美丽的汤中使用soup.select('占位符')[0] .get_text()时列出超出范围的错误

时间:2015-09-11 19:09:51

标签: python web-scraping beautifulsoup

刮痧的新手,我试图使用美丽的汤来从维基百科页面获取轴距值(最终其他东西)(我将在稍后处理robots.txt)This is the guide I've been using < / p>

两个问题 1.)如何解决以下错误? 2.)如何刮除包含轴距的单元格中的值只是&#34; td #Targbase td&#34; ?

我得到的错误是

File "evscraper.py", line 25, in <module>
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3')       [0].get_text()
IndexError: list index out of range

感谢您的帮助!

__author__ = 'KirkLazarus'
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests


response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)



wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3')[0].get_text()

print wheelbase_data

2 个答案:

答案 0 :(得分:0)

你的第一个问题是你的选择器。没有ID为&#34;轴距&#34;在那个页面上,所以它返回一个空列表。

以下内容绝不是完美的,但只会因为您已经了解页面的结构而得到您想要的内容:

import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests

wheelbase_data = {}

response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)

for link in soup.find_all('a'):
    if link.get('href') == "/wiki/Wheelbase":
        wheelbase = link
        break

wheelbase_data['Wheelbase'] = wheelbase.parent.parent.td.text

答案 1 :(得分:0)

看起来你正在寻找错误的路径。我过去不得不做类似的事情......我不确定这是不是最好的方法,但对我来说肯定有用。

import pandas as pd
from bs4 import BeautifulSoup
import urllib2


car_data = pd.DataFrame()

models = ['Tesla_Model_S','Tesla_Model_X']

for model in models:

    wiki = "https://en.wikipedia.org/wiki/{0}".format(model)
    header = {'User-Agent': 'Mozilla/5.0'} 
    req = urllib2.Request(wiki,headers=header)
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)
    table = soup.find("table", { "class" : "infobox hproduct" })

    for row in table.findAll("tr")[2:]:
        try:
            field = row.findAll("th")[0].text.strip()
            val = row.findAll("td")[0].text.strip()
            car_data.set_value(model,field,val)
        except:
            pass

car_data