我想抓取http://quotes.toscrape.com/search.aspx以获得所有报价。 尽管保存了___VIEWSTATE参数并传递了该参数,但还是出现了错误500。
import requests
from bs4 import BeautifulSoup
url = 'http://quotes.toscrape.com/search.aspx'
#headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"}
s = requests.Session()
#s.headers.update(headers)
page = s.get(url)
page.raise_for_status()
soup = BeautifulSoup(page.content)
authors = soup.find('select', id="author")
url = 'http://quotes.toscrape.com/filter.aspx'
for author in authors.stripped_strings:
if author != '----------':
parameters = {'tag' : '----------'}
parameters.update({"___VIEWSTATE" : soup.select_one("#__VIEWSTATE")["value"]})
parameters.update({"author" : author}) #autor.replace(" ", "+") ?
print(parameters)
page = s.post(url, data = parameters)
page.raise_for_status() #ERROR 500
soup = BeautifulSoup(page.content)
tags = soup.find('select', id="tag")
parameters.update({"submit_button" : "Search"})
for tag in tags.stripped_strings:
if tag != '----------':
parameters.update({"___VIEWSTATE" : soup.select_one("#__VIEWSTATE")["value"]})
parameters.update({"tag" : tag})
page = s.post(url, data = parameters)
page.raise_for_status()
soup = BeautifulSoup(page.content)
print(author + "-" + tag)
print(soup.find("span", class_="content").get_text())
答案 0 :(得分:0)
import requests
from bs4 import BeautifulSoup
first = "http://quotes.toscrape.com/search.aspx"
second = "http://quotes.toscrape.com/filter.aspx"
data = {
'tag': '----------'
}
def parse(page):
soup = BeautifulSoup(page, 'html.parser')
return soup
def main(url):
with requests.Session() as req:
r = req.get(url)
soup = parse(r.content)
authors = [aut.get_text(strip=True)
for aut in soup.findAll("option", value=True)]
data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE")['value']
for auth in authors:
print(f"Extracting Author {auth}")
data['author'] = auth
r = req.post(second, data=data)
soup = parse(r.content)
tags = [list(tag.stripped_strings)[1:]
for tag in soup.findAll("select", id="tag")][0]
data['submit_button'] = "Search"
for tag in tags:
data['tag'] = tag
r = req.post(second, data=data)
soup = parse(r.content)
goal = soup.select_one("span.content").text
print("Tag --> {:<20} = {}".format(tag, goal))
main(first)
输出样本:
Extracting Author Albert Einstein
Tag --> change = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> deep-thoughts = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> thinking = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> world = “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tag --> inspirational = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> life = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> live = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> miracle = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> miracles = “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tag --> adulthood = “Try not to become a man of success. Rather become a man of value.”
Tag --> success = “Try not to become a man of success. Rather become a man of value.”
Tag --> value = “Try not to become a man of success. Rather become a man of value.”
Tag --> simplicity = “If you can't explain it to a six year old, you don't understand it yourself.”
Tag --> understand = “If you can't explain it to a six year old, you don't understand it yourself.”
Tag --> children = “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
Tag --> fairy-tales = “If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”
Tag --> imagination = “Logic will get you from A to Z; imagination will get you everywhere.”
Tag --> knowledge = “Any fool can know. The point is to understand.”
Tag --> learning = “Any fool can know. The point is to understand.”
Tag --> understanding = “Any fool can know. The point is to understand.”
Tag --> wisdom = “Any fool can know. The point is to understand.”
Tag --> simile = “Life is like riding a bicycle. To keep your balance, you must keep moving.”
Tag --> music = “If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”
Tag --> mistakes = “Anyone who has never made a mistake has never tried anything new.”
Extracting Author J.K. Rowling
Tag --> abilities = “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Tag --> choices = “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Tag --> courage = “It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”
Tag --> friends = “It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”
Tag --> dumbledore = “Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”
完整输出:view online
答案 1 :(得分:0)
只需将“ ___VIEWSTATE”更改为“ __VIEWSTATE”
parametros.update({"___VIEWSTATE" : soup.select_one("#__VIEWSTATE")["value"]})