所以我把几段代码放在一起,这样我就可以分析一个感兴趣的页面的html。但是,我没有获得与正常访问我的浏览器时获得的相同的HTML
我的意思是,在bowser中我得到<body>
标签的所有标签和内容,如<h1>
等等,result.text
给了我一个没有任何东西的身体我在浏览器的控制台中运行的decodeURIComponent
,但是没有给出页面的其余部分,只有像结果一样的json
OBS¹:我尝试过显示编码,发送Referer
和User-Agent
标头并在会话中运行它们。什么都没有用
OBS²:顺便说一句,这是网站:http://www.danielfischer.com/
这是我收到的页面,在<body>
标记上没有注意到任何内容,该标记应该有很多块引用:
<!DOCTYPE html>\n
<html>
\n
<head>
\n
<link rel="stylesheet" type="text/css" class="__meteor-css__" href="/3645e6749a7bb15e2b7a2b598d31f70b37ebf857.css?meteor_css_resource=true">
\n
<title>Daniel Fischer / Leader, Developer, Designer - San Francisco & Los Angeles</title>
\n\n
</head>
\n
<body>
\n\n\n\n<script type="text/javascript">__meteor_runtime_config__ = JSON.parse(decodeURIComponent("%7B%22meteorRelease%22%3A%22METEOR%401.3%22%2C%22meteorEnv%22%3A%7B%22NODE_ENV%22%3A%22production%22%2C%22TEST_METADATA%22%3A%22%7B%7D%22%7D%2C%22PUBLIC_SETTINGS%22%3A%7B%7D%2C%22ROOT_URL%22%3A%22http%3A%2F%2Fwww.danielfischer.com%22%2C%22ROOT_URL_PATH_PREFIX%22%3A%22%22%2C%22appId%22%3A%22l92idpwahzxm4o1rd1%22%2C%22autoupdateVersion%22%3A%22bc9eac9e135921bb6593bdd334fe6855bedbb06e%22%2C%22autoupdateVersionRefreshable%22%3A%2237e5fc255eafc269ecb1fa482090c66aa8d627cc%22%2C%22autoupdateVersionCordova%22%3A%22none%22%7D"));</script>\n\n <script type="text/javascript" src="/5dc2d6c014e5a058f745b57cca61e0d242cf06b7.js?meteor_js_resource=true"></script>\n\n\n
</body>
\n
</html>
\n'
这是我的Python代码,请忽略评论的行,我只是让他们在那里展示我的另一个修复尝试:
import requests, bs4
url = 'http://www.danielfischer.com/'
with requests.Session() as s:
page = s.get(url, headers={"Referer": "https://www.facebook.com/", 'User-Agent':'test'})
print(page.text.encode('utf-8'))
#page.encoding = 'utf-8'
#pageParsed = bs4.BeautifulSoup(pageRaw, "html.parser")
#outfile = open(path, 'w')
#outfile.write(str(page.text))
#print(pageRaw.text)
任何见解都将受到赞赏:)