使用python进行网页抓取时出现问题。
代码:
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import urllib.error
import http.cookiejar,requests,pymysql,json ,re
session = requests.Session()
monthurl = 'http://search.proquest.com/publication.publicationissuebrowse:drilldown/month/%E5%85%AB%E6%9C%88/08/year/2016/parentmonth082016'
payload = {"site": "news","t:ac" : "publications_105983"}
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0','Accept':'text/javascript, text/html,application/xml, text/xml, */*',\
'Accept-Encoding':'gzip, deflate','Accept-Language':'zh-CN,zh;q=0.8','Host':'search.proquest.com', 'Content-type':'application/x-www-form-urlencoded; charset=UTF-8', 'Connection':'keep-alive','Content-Length':'0','Origin':'http://search.proquest.com','Referer':'http://search.proquest.com/news/publication/105983/citation/99D2C84D41804033PQ/2?accountid=13818','X-Prototype-Version':'1.7','X-Requested-With':'XMLHttpRequest',\
'Cookie':'availability-zone=us-east-1a; mwtbid=830706AE-9389-4BB4-812D-B597683B812E; _ga=GA1.2.1201070524.1446763952; fsr.r=%7B%22d%22%3A90%2C%22i%22%3A%22de07553-78769885-bcc1-4823-67c96%22%2C%22e%22%3A1467984529571%7D; fulltextShowAll=YES; oneSearchTZ=480; authenticatedBy=IP; availability-zone=us-east-1a; _gat_UA-61126923-3=1; JSESSIONID=69A1CC852FF1123A9A78CFC18E2B6AFF.i-b86ebbb9; OS_VWO_COUNTRY=CN; OS_VWO_INSTITUTION=13818; OS_VWO_LANGUAGE=zho; OS_VWO_MY_RESEARCH=false; OS_VWO_REFERRING_URL=""; OS_VWO_REQUESTED_URL="http://search.proquest.com/news/publication/105983/citation/8558F5818C234BCFPQ/2?accountid=13818"; OS_PERSISTENT="wrPZtfJDrH0WIWT5cZZs+CwLAAUhJMHD++Vls3rVx5E="; OS_VWO_VISITOR_TYPE=returning; AWSELB=C393A78D02CA3EE2799CF8894B23627240E8CACE66D1C0BB8AD720DF21EC8ACE1D897A32BEBC089642A0472335D0E12E2E117186F0CCDBF88A5E8AB2CD9F31FA13EA9CDBB3A68FF4DB78B55F4406384017E95C9573; AppVersion=r20161.6.0.834.574; _vwo_uuid_v2=0308785C38305F47209E7EC8811AC0A2|3ec2dd2ac5e7bfcc195a554e24406f22; osTimestamp=1472090234.391; WT_FPC=id=202.120.14.195-2899434048.30480412:lv=1472043437504:ss=1472043437504; fsr.s=%7B%22cp%22%3A%7B%22Usage_Session%22%3A%2220160825015947140%3A312846%22%2C%22cxreplayaws%22%3A%22true%22%2C%22Error_Page%22%3A%22no%22%2C%22No_Results%22%3A%22no%22%2C%22My_Research%22%3A%22no%22%2C%22Advanced%22%3A%22no%22%2C%22Professional%22%3A%22no%22%2C%22User_IP%22%3A%22202.120.19.186%22%2C%22Session_ID%22%3A%2269A1CC852FF1123A9A78CFC18E2B6AFF.i-b86ebbb9%22%2C%22Account_ID%22%3A%2213818%22%7D%2C%22v1%22%3A-2%2C%22v2%22%3A-2%2C%22rid%22%3A%22de07553-78562942-af91-5f91-ed200%22%2C%22ru%22%3A%22http%3A%2F%2Fourex.lib.sjtu.edu.cn%2Fprimo_library%2Flibweb%2Faction%2Fdisplay.do%3Bjsessionid%3D73028D8B75DB2FF259A0E736836BAA07%3Ftabs%3DdetailsTab%26ct%3Ddisplay%26fn%3Dsearch%26doc%3Dsjtulibxw000061822%26indx%3D1%26recIds%3Dsjtulibxw000061822%26recIdxs%3D0%26elementId%3D0%26renderMode%3DpoppedOut%26displayMode%3Dfull%26frbrVersion%3D%26dscnt%3D0%26scp.scps%3Dscope%253A%2528SJT%2529%252Cscope%253A%2528sjtu_metadata%2529%252Cscope%253A%2528sjtu_sfx%2529%252Cscope%253A%2528sjtulibzw%2529%252Cscope%253A%2528sjtulibxw%2529%252CDuxiuBook%26tab%3Ddefault_tab%26dstmp%3D1472033627266%26vl(freeText0)%3Dproquest%26vid%3Dchinese%22%2C%22r%22%3A%22ourex.lib.sjtu.edu.cn%22%2C%22st%22%3A%22%22%2C%22to%22%3A5%2C%22pv%22%3A34%2C%22lc%22%3A%7B%22d0%22%3A%7B%22v%22%3A34%2C%22s%22%3Atrue%7D%7D%2C%22cd%22%3A0%2C%22f%22%3A1472090225890%2C%22pn%22%3A0%2C%22sd%22%3A0%7D; _ga=GA1.3.1201070524.1446763952'}
req = session.post(monthurl,data = payload,headers = headers)
main = BeautifulSoup(req.text,"html.parser").decode('utf-8')
print(main)
结果示例:['/publication.publicationissuebrowse:openissue/issueName/02016Y08Y25$23Aug+25,+
2016?site=news&t;:ac=publications_105983']
(这是一个列表,我只显示一个简洁的元素),
这就是网址的实际内容:/publication.publicationissuebrowse:openissue/issueName/02016Y08Y25$23Aug+25,+
2016?site=news&t:ac=publications_105983
不#34;"和";"在" t"之后,
所以我实际上有两个问题,为什么会这样?以及如何解决它?我可以只替换列表元素中的特定字符吗?
答案 0 :(得分:3)
你得到的东西显然应该被注入网站。 &
只是因为html使用而被&
转义。他们是等同的,但你必须先取消它。您已在https://wiki.python.org/moin/EscapingHtml
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
# this has to be last:
s = s.replace("&", "&")
return s
至于遗失的;
- 这是网站在JS中处理的东西,或者两个网址都可以正常工作。绝对不是这个代码中的错误。仔细检查网站上的脚本。