Python获取请求返回的不同HTML而不是查看源

时间:2016-07-06 17:54:04

标签: python selenium web-scraping beautifulsoup phantomjs

我正在尝试从Archive of Our Own URL中提取小说,以便使用NLTK库对其进行语言分析。但是,从URL中抓取HTML的每一次尝试都返回了所有内容,但是返回了fanfic(以及我不需要的注释表单)。

首先我尝试使用内置的urllib库(和BeautifulSoup):

import urllib
from bs4 import BeautifulSoup    
html = request.urlopen("http://archiveofourown.org/works/6846694").read()
soup = BeautifulSoup(html,"html.parser")
soup.prettify()

然后我发现了Requests库,以及User Agent如何成为问题的一部分,所以我尝试了同样的结果:

import requests
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
        'Content-Type': 'text/html',
}
requests.get("http://archiveofourown.org/works/6846694",headers=headers,timeout=5).text

然后我发现了Selenium和PhantomJS,所以我安装了这些并尝试了这个但是同样的结果:

from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.PhantomJS()
browser.get("http://archiveofourown.org/works/6846694")
soup = BeautifulSoup(browser.page_source, "html.parser")
soup.prettify()

我是否在这些尝试中做错了什么,或者这是服务器的问题?

2 个答案:

答案 0 :(得分:2)

如果您需要完整的页面源以及所有执行的JavaScript和异步请求,那么最后一种方法是向正确方向迈出的一步。你只是缺少一件事 - 你需要give PhantomJS time才能在阅读源代码之前加载页面(有意为双关语)。

而且,您还需要点击"继续"您同意看到成人内容:

[MPSA-TotRev] as TotalComp,
EBITDACalc = (Comp.VariableComp - (Comp.VariableComp * 1-AdjGMPercent)) - ([FixedCosts]-(Comp.[FixedComp]),
------Balance Sheet Metrics--------
WorkingCapital

答案 1 :(得分:1)

Alexce已经解释了为什么你的代码没有给你你想要的东西,如果你想要的只是你添加param view_adult=true的源中可用的文本:

import requests
from bs4 import BeautifulSoup
url = "http://archiveofourown.org/works/6846694?view_adult=true"


r= requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
chap = soup.select_one("#chapter-1")
preface = soup.select_one("div.preface.group")


print(preface)
print(chap)

那会给你:

<div class="preface group">
<h2 class="title heading">
      The Complete Works of Emmanuel Allen
    </h2>
<h3 class="byline heading">
<a href="http://archiveofourown.org/users/violue/pseuds/violue" rel="author">violue</a>
</h3>
<div class="summary module" role="complementary">
<h3 class="heading">Summary:</h3>
<blockquote class="userstuff">
<p>Dean Winchester, reluctant business owner, reluctant home owner, and reluctant cat owner, is striking up a very promising friendship with the author of his favorite book series.</p><p>And he has no idea.</p>
</blockquote>
</div>
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<blockquote class="userstuff">
<p>Oh yeah, I've got notes.</p><p>
<s>1.) This is complete, though later chapters are still being beta'd. I'll be posting a chapter at a time, whenever the hell I feel like it. Probably every day/every other day because it's hard to just SIT ON ALL THESE CHAPTERS I HAVE WHEN THEY'RE READY TO POST!!!</s>
</p><p>2.) This is of the mostly aimless domestic fluff variety, in that there's no big overarching storyline. But that's pretty common with my stories.  ¯\_(ツ)_/¯ </p><p>3.) There's a bit of <i>me</i> in this story. I am a depressed and surly cat owner living in the Pacific Northwest, and so is Dean, but most of this is just my imagination.</p><p>4.) Thanks to <a href="http://archiveofourown.org/users/Tennyo/works">TENNYO</a>, <a href="http://chiwalker.tumblr.com/">CHIWALKER</a>, <a href="http://buckysbuckhole.tumblr.com/">CASFUCKER</a>, and <a href="http://kelisab.tumblr.com">KELISAB</a> for beta'ing! If you find mistakes in the story, it's all their fault, and you should throw soggy tomatoes at them.</p><p>5.) No, I think that's it. Start reading.</p>
</blockquote>
</div>
</div>
<div class="chapter" id="chapter-1">
<!-- chapter management -->
<div class="chapter preface group" role="complementary">
<h3 class="title">
<a href="/works/6846694/chapters/15628576">Chapter 1</a>: Prologue
    </h3>
<!-- only display byline if different from the main byline -->
</div>
<!--main content-->
<div class="userstuff module" role="article">
<h3 class="landmark heading" id="work">Chapter Text</h3>
<p>“Wow, that’s beautiful!”</p><p>Dean doesn’t even have to look up from his book to know what this customer is talking about. Winchester General Store has a lot of things; food, beer, toiletries, camping gear, used books and more, but the only thing that could be considered “beautiful” in this store is the hand-carved, ornate wooden house sitting in a display case mounted on the wall behind Dean. Actually, “house” isn’t the right word. It started as a house in Dean’s mind, but by the time he was done carving, sanding, polishing, and in some places hot gluing the white oak structure, it had become a mausoleum. A beautiful, <em>inviting </em>mausoleum, but a mausoleum nonetheless. Dean had even borrowed some acrylic paints from Charlie to color the climbing ivy painstakingly carved onto the sides.</p><p>“Thanks, man,” Dean says, setting his book down. Might as well let the guy know this was <em>his </em>hard work.</p><p>The man’s eyes widen. “You <em>made </em>this?”</p><p>“Sure did. Worked on it for two months.” Dean nods toward the twelve pack of Mountain Dew the customer is holding. “You all set?”</p><p>The man puts the case on the counter by the register, and Dean rings it up. “How much?”</p><p>“Eight ninety-nine for the Dew.”</p><p>The man shakes his head. “No, I mean the sculpture. My wife and I just bought a place up in Cougar Falls, and that would look <em>great </em>in the front room.”</p><p>Dean blinks, surprised. He’s gotten a lot of compliments on the mausoleum in the past ten or so months, but no one’s ever assumed it was for sale before.</p><p>“Sorry, man, not for sale.”</p><p>“Come on. Name your price.” Dean gets all sorts of customers here. Locals, people out in the area for camping, people up here to go rafting down Filbert River, and of course, people just passing through on their way to some place bigger and better. This guy falls into the last category.</p><p>“No can do, that thing’s got something important inside. Can’t part with it.”</p><p>“Important? Like what?”</p><p>Dean shrugs. “My parents.”</p><p>“W… what?” the man stammers.</p><p>“Yeah. There’s an urn inside. Kinda had to glue the top of the building on to get the urn in there, but you can’t really tell unless you’re real close and looking at just the right angle.”</p><p>“<em>Both </em>of your parents?”</p><p>“Well, my mom died ages ago, and my dad kept her ashes the rest of his life.” Dean turns to look at his carving fondly. “And when my dad died, we had him cremated too. One night I got real drunk, I was still kind of in mourning, and I decided my parents should be together. So I dumped my dad’s ashes into my mom’s urn, and then I gave the urn a good shake,” Dean says, shaking an imaginary urn. “My brother was <em>pissed </em>when I told him, but he’s over it now. Anyway, I made this here structure to keep them in. Sort of an apology gift.”</p><p>The bell over the front door jingles, and Dean turns back to see the customer has taken off. “Don’t you want your Mountain Dew?” he yells, even though the guy’s already outside.</p><p>Jeez. What a wimp. Dean reaches into the display case, patting the top of the mausoleum gently. “What a baby. Am I right, guys?”</p><p>The urn full of Winchester ashes stays silent of course. Dean snickers, picks his book up off the counter, and gets back to reading.</p><p><br/>
<br/>
</p><p> </p>
</div>
<!--/main-->
</div>

希望你能满足所需要的一切。