为什么Beautiful Soup没有找到多个类的元素?

时间:2016-05-04 18:40:55

标签: python web-scraping beautifulsoup element

我尝试使用查询soup.find_all("li", {"class" : "first_result"})

选择看似soup.find_all("li", {"class" : "result first_result"})的元素

该元素肯定在页面上,但在我运行脚本时它没有出现。我还尝试###METACRITIC STUFF###进行记录,但仍然没有。

我做错了什么?

编辑:在alecxe的要求下,我发布了迄今为止的代码。我使用的是使用Python 3.4的64位Windows 7,我确信这是罪魁祸首。我提出这个问题的具体部分位于from bs4 import BeautifulSoup from urllib3 import poolmanager import csv import requests import sys import os import codecs import re import html5lib import math import time from random import randint connectBuilder = poolmanager.PoolManager() inputstring = sys.argv[1] #argv string MUST use double quotes inputarray = re.split('\s+',inputstring) ##########################KAT STUFF######################## katstring = "" for item in inputarray: katstring += (item + "+") katstring=katstring[:-1] #kataddress = "https://kat.cr/usearch/?q=" + katstring #ALL kat kataddress = "https://kat.cr/usearch/" + inputstring + " category:tv/?field=seeders&sorder=desc" #JUST TV kat #print(kataddress) numSeedsArray = [] numLeechArray = [] r = requests.get(kataddress) soup = BeautifulSoup(r.content, "html5lib") totalpages = [h2.find('span') for h2 in soup.findAll('h2')][0].text #get a string that looks like 'house of cards results 1-25 from 178' totalpages = int(totalpages[-4:]) #slice off everything but the total # of pages totalpages = math.floor(totalpages/25) #print("totalpages= "+str(totalpages)) iteration=0 savedpage = "" def getdata(url): r = requests.get(url) soup = BeautifulSoup(r.content, "html5lib") global numSeedsArray global numLeechArray tds = soup.findAll("td", { "class" : "green center" }) numSeedsArray += [int(td.text) for td in tds] tds = soup.findAll("td", { "class" : "red lasttd center"}) numLeechArray += [int(td.text) for td in tds] #print(numSeedsArray) def getnextpage(url): global iteration global savedpage #print("url examined= "+url) r = requests.get(url) soup = BeautifulSoup(r.content, "html5lib") nextpagelinks = soup.findAll("a", { "class" : "turnoverButton siteButton bigButton" }) nextpagelinks = [link.get('href') for link in nextpagelinks] #print(nextpagelinks) activepage = soup.findAll("a", { "class" : "turnoverButton siteButton bigButton active" }) #print("activepage= " +activepage[0].text) currentpagenum = activepage[0].text #print("currentpagenum= "+currentpagenum) if len(currentpagenum)==1 and iteration>1: nextpage = savedpage+str(int(currentpagenum)+1)+str(nextpagelinks[0][-27:]) #print("nextpage= "+nextpage) nextpage = re.sub(r'(%20)', ' ', nextpage) nextpage = re.sub(r'(%3A)', ':', nextpage) nextpage = "https://kat.cr"+nextpage #print(nextpage) elif len(currentpagenum)==1 and iteration<=1: nextpage = str(nextpagelinks[0][:-28])+str(int(currentpagenum)+1)+str(nextpagelinks[0][-27:]) savedpage = str(nextpagelinks[0][:-28]) #print("savedpage= "+savedpage ) nextpage = re.sub(r'(%20)', ' ', nextpage) nextpage = re.sub(r'(%3A)', ':', nextpage) nextpage = "https://kat.cr"+nextpage #print(nextpage) elif len(currentpagenum)==2: nextpage = savedpage+str(int(currentpagenum)+1)+str(nextpagelinks[0][-27:]) #print("nextpage= "+nextpage) nextpage = re.sub(r'(%20)', ' ', nextpage) nextpage = re.sub(r'(%3A)', ':', nextpage) nextpage = "https://kat.cr"+nextpage #print(nextpage) return nextpage if totalpages<2: while iteration < totalpages-1: #should be totalpages-1 for max accuracy getdata(kataddress) iteration+=1 kataddress = getnextpage(kataddress) else: while iteration < 2: #should be totalpages-1 for max accuracy getdata(kataddress) iteration+=1 kataddress = getnextpage(kataddress) # print(str(sum(numSeedsArray))) # print(str(sum(numLeechArray))) print(str(sum(numLeechArray)+sum(numSeedsArray))) def getgoogdata(title): title = re.sub(r' ', '+', title) url = 'https://www.google.com/search?q=' +title+ '&ie=utf-8&oe=utf-8' r = requests.get(url) soup = BeautifulSoup(r.content, "html5lib") resultnum = soup.find("div", {"id": "resultStats"}).text[:-14] s2 = resultnum.replace(',', '') resultnum = re.findall(r'\b\d+\b', s2) print(resultnum) getgoogdata(inputstring) ####################METACRITIC STUFF######################### metainputstring = "" for item in inputarray: metainputstring += item + " " metainputstring = metainputstring[:-1] metacriticaddress = "http://www.metacritic.com/search/tv/" + metainputstring + "/results" print (metacriticaddress) r = requests.get(metacriticaddress) soup = BeautifulSoup(r.content, "html5lib") first_result = soup.find_all("li", attrs={"class" : "first_result"}) # first_result = soup.select("li.result.first_result") print(first_result)

下的最底层
angular2-polyfills.js

3 个答案:

答案 0 :(得分:2)

Quoting the documentation:

  

搜索具有特定CSS类的标记非常有用,但CSS属性的名称“class”是Python中的保留字。使用class作为关键字参数会给出语法错误。从Beautiful Soup 4.1.2开始,您可以使用关键字参数class_

按CSS类进行搜索

因此,您需要改为:soup.find_all("li", class_="first_result")

如果您使用的是4.1.2版本的BeautifulSoup,或者您坚持传递字典,则需要指定字典填充attrs参数:soup.find_all("li", attrs={"class" : "first_result"})

答案 1 :(得分:2)

所有其他答案与您的实际问题无关。

您需要伪装成真正的浏览器才能看到搜索结果:

r = requests.get(metacriticaddress, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})

证明(当然,寻找权力的游戏):

>>> from bs4 import BeautifulSoup
>>> 
>>> import requests
>>> 
>>> metacriticaddress = "http://www.metacritic.com/search/tv/game%20of%20thrones/results"
>>> r = requests.get(metacriticaddress, headers={
...     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
... })
>>> soup = BeautifulSoup(r.content, "html5lib")
>>> first_result = soup.find_all("li", class_="first_result")
>>> 
>>> print(first_result[0].find("h3", class_="product_title").get_text(strip=True))
Game of Thrones

答案 2 :(得分:1)

您的第一次尝试(soup.find_all("li", {"class" : "first_result"}))几乎是正确的,但您需要指定字典传递给的参数(在本例中参数名称为attrs),并将其称为soup.find_all("li", attrs={"class" : "first_result"})

然而,我建议使用CSS选择器执行此操作,因为您要对多个类进行匹配。你可以使用像这样的汤的.select()方法来做到这一点

results = soup.select("li.result.first_result")

请注意.select()将始终返回一个列表,因此,如果只有一个元素,请不要忘记以results[0]的形式访问它。