我一般都了解如何使用urllib2(编码数据等)发出POST
请求,但问题是所有教程在线使用完全无用的虚构示例网址来说明如何执行此操作(someserver.com
,coolsite.org
等),因此我无法看到与其使用的示例代码相对应的特定html。即使python.org
自己的tutorial在这方面也完全没用。
我需要向此网址发出POST
个请求:
https://patentscope.wipo.int/search/en/search.jsf
代码的相关部分是这个(我认为):
<form id="simpleSearchSearchForm" name="simpleSearchSearchForm" method="post" action="/search/en/search.jsf" enctype="application/x-www-form-urlencoded" style="display:inline">
<input type="hidden" name="simpleSearchSearchForm" value="simpleSearchSearchForm" />
<div class="rf-p " id="simpleSearchSearchForm:sSearchPanel" style="text-align:left;z-index:-1;"><div class="rf-p-hdr " id="simpleSearchSearchForm:sSearchPanel_header">
也许就是这样:
<input id="simpleSearchSearchForm:fpSearch" type="text" name="simpleSearchSearchForm:fpSearch" class="formInput" dir="ltr" style="width: 400px; height: 15px; text-align: left; background-image: url("https://patentscope.wipo.int/search/org.richfaces.resources/javax.faces.resource/org.richfaces.staticResource/4.5.5.Final/PackedCompressed/classic/org.richfaces.images/inputBackgroundImage.png"); background-position: 1px 1px; background-repeat: no-repeat;">
如果我想将JP2014084003
编码为搜索词,那么html中使用的对应值是多少? input id
? name
? value
?
附录:this answer没有回答我的问题,因为它只是重复了我在python文档页面中已经看过的信息。
更新:
我找到this,并在那里试用了代码,具体来说是:
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'name':'simpleSearchSearchForm:fpSearch','value':'2014084003'}
link = 'https://patentscope.wipo.int/search/en/search.jsf'
session = requests.Session()
resp = session.get(link,headers=headers)
cookies = requests.utils.cookiejar_from_dict(requests.utils.dict_from_cookiejar(session.cookies))
resp = session.post(link,headers=headers,data=payload,cookies =cookies)
r = session.get(link)
f = open('htmltext.txt','w')
f.write(r.content)
f.close()
我得到了一个成功的回复(200
),但数据再次只是原始页面中的数据,所以我不知道我是否正确地发布到表单中我需要做些其他事情才能让它从搜索结果页面返回数据,或者如果我仍然错误地发布数据。
是的,我意识到这会使用requests
而不是urllib2
,但我想要做的就是获取数据。
答案 0 :(得分:3)
这不是最直接的帖子请求,如果您查看开发人员工具或firebug,您可以从成功的浏览器帖子中看到formdata:
所有这一切都非常直截了当,因为您看到密钥中嵌入了一些:
,这可能会让您感到有些困惑,simpleSearchSearchForm:commandSimpleFPSearch
是密钥和Search
。
你唯一不能硬编码的是javax.faces.ViewState
,我们需要向网站发出请求,然后解析我们可以用BeautifulSoup做的那个值:
import requests
from bs4 import BeautifulSoup
url = "https://patentscope.wipo.int/search/en/search.jsf"
data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
"simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
"simpleSearchSearchForm:fpSearch": "automata",
"simpleSearchSearchForm:commandSimpleFPSearch": "Search",
"simpleSearchSearchForm:j_idt406": "workaround"}
head = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
with requests.Session() as s:
# Get the cookies and the source to parse the Viewstate token
init = s.get(url)
soup = BeautifulSoup(init.text, "lxml")
val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
# update post data dict
data["javax.faces.ViewState"] = val
r = s.post(url, data=data, headers=head)
print(r.text)
如果我们运行上面的代码:
In [13]: import requests
In [14]: from bs4 import BeautifulSoup
In [15]: url = "https://patentscope.wipo.int/search/en/search.jsf"
In [16]: data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
....: "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
....: "simpleSearchSearchForm:fpSearch": "automata",
....: "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
....: "simpleSearchSearchForm:j_idt406": "workaround"}
In [17]: head = {
....: "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [18]: with requests.Session() as s:
....: init = s.get(url)
....: soup = BeautifulSoup(init.text, "lxml")
....: val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
....: data["javax.faces.ViewState"] = val
....: r = s.post(url, data=data, headers=head)
....: print("\n".join([s.text.strip() for s in BeautifulSoup(r.text,"lxml").select("span.trans-section")]))
....:
Fuzzy genetic learning automata classifier
Fuzzy genetic learning automata classifier
FINITE AUTOMATA MANAGER
CELLULAR AUTOMATA MUSIC GENERATOR
CELLULAR AUTOMATA MUSIC GENERATOR
ANALOG LOGIC AUTOMATA
Incremental automata verification
Cellular automata music generator
Analog logic automata
Symbolic finite automata
您会看到它与网页匹配。如果你想要抓取网站,你需要熟悉开发人员工具/ firebug等..看看请求是如何进行的,然后尝试模仿。要打开firebug,请右键单击页面并选择inspect元素,单击网络选项卡并提交请求。您只需从列表中选择请求,然后选择您想要信息的任何选项卡,即邮件请求的参数:
您可能还会发现此answer对于如何发布网站非常有用。