美丽的汤在实际结束之前有额外的

时间:2016-09-19 01:57:33

标签: python html web-scraping beautifulsoup

我正试图从PoetryFoundation.org收集诗歌。我在我的一个测试用例中发现,当我从特定的诗中提取html时,它会在实际诗的结尾之前包含一个额外的</body>。我可以在线查看这首诗的源代码,并且在诗的中间没有(正如预期的那样)。我使用特定案例的url创建了一个示例,以便其他人可以尝试复制问题:

from bs4 import BeautifulSoup
from urllib.request import urlopen

poem_page = urlopen("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.read(), "html5lib")
print(poem_soup)

我正在运行Python 3.5.1。我尝试使用默认解析器html.parser以及html5liblxml

在打印输出中,如果你搜索“在诗中”,你会发现这段html,这没有任何意义,因为它会在</body></html>的诗句中途结束整个html文档,然后继续与文档的其余部分一起:

in the poem</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></body></html>. But when we met,<br/><div style="text-indent: -1em; padding-left: 1em;"><br/>

我已经在线查看了源代码,这应该是:

in the poem</em>. But when we met,<br></div><div style="text-indent: -1em; padding-left: 1em;">

我不知道为什么当我刮掉它时,它会在页面中间关闭整个html文档。

1 个答案:

答案 0 :(得分:0)

当我尝试用你的网址html.parser来获取这首诗时,我遇到了与你相同的问题.html在in the poem位置被截断。

import requests
from bs4 import BeautifulSoup

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.text, "html.parser")
poem_div = poem_soup.find('div', class_='poem')
print poem_div

输出:

<div class="poem" data-view="ContentView">
<div style="text-indent: -1em; padding-left: 1em;">It seems a certain fear underlies everything. <br/></div><div style="text-indent: -1em; padding-left: 1em;">If I were to tell you something profound<br/></div><div style="text-indent: -1em; padding-left: 1em;"> it would be useless, as every single thing I know<br/></div><div style="text-indent: -1em; padding-left: 1em;"> is not timeless. I am particularly risk-averse.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">I choose someone else over me every time, <br/></div><div style="text-indent: -1em; padding-left: 1em;">as I'm sure they'll finish the task at hand, <br/></div><div style="text-indent: -1em; padding-left: 1em;">which is to say that whatever is in front of us<br/></div><div style="text-indent: -1em; padding-left: 1em;"> will get done if I'm not in charge of it.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">There is a limit to the number of times <br/></div><div style="text-indent: -1em; padding-left: 1em;">I can practice every single kind of mortification <br/></div><div style="text-indent: -1em; padding-left: 1em;">(of the flesh?). I can turn toward you and say <em>yes, <br/></em></div><div style="text-indent: -1em; padding-left: 1em;">it was you in the poem</div></div>

但是将解析器更改为lxml,一切正常。

import requests
from bs4 import BeautifulSoup

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.text, "lxml")
poem_div = poem_soup.find('div', class_='poem')
# print poem_div
for s in poem_div.find_all('div'):
    print list(s.children)[0]

输出:

It seems a certain fear underlies everything. 
If I were to tell you something profound
 it would be useless, as every single thing I know
 is not timeless. I am particularly risk-averse.
<br/>
I choose someone else over me every time, 
as I'm sure they'll finish the task at hand, 
which is to say that whatever is in front of us
 will get done if I'm not in charge of it.
<br/>
There is a limit to the number of times 
I can practice every single kind of mortification 
(of the flesh?). I can turn toward you and say 
it was you in the poem. But when we met,
<br/>
you were actually wearing a shirt, and the poem 
wasn't about you or your indecipherable tattoo. 
The poem is always about me, but that one time 
I was in love with the memory of my twenties
<br/>
so I was, for a moment, in love with you 
because you remind me of an approaching
 subway brushing hair off my face with 
its hot breath. Darkness. And then light,
<br/>
the exact goldness of dawn fingering
 that brick wall out my bedroom window 
on Smith Street mornings when I'd wake
 next to godknowswho but always someone
<br/>
who wasn't a mistake, because what kind 
of mistakes are that twitchy and joyful 
even if they're woven with a particular 
thread of regret: the guy who used
<br/>
my toothbrush without asking,
I walked to the end of a pier with him,
would have walked off anywhere with him
until one day we both landed in California
<br/>
when I was still young, and going West
meant taking a laptop and some clothes
in a hatchback and learning about produce.
I can turn toward you, whoever you are,
<br/>
and say you are my lover simply because
I say you are, and that is, I realize,
a tautology, but this is my poem. I claim
nothing other than what I write, and even that,
<br/>
I'd leave by the wayside, since the only thing
to pack would be the candlesticks, and 
even those are burned through, thoroughly
replaceable. Who am I kidding? I don't
<br/>
own anything worth packing into anything.
We are cardboard boxes, you and I, stacked
nowhere near each other and humming
different tunes. It is too late to be writing this.
<br/>
I am writing this to tell you something less
than neutral, which is to say I'm sorry.
It was never you. It was always you:
your unutterable name, this growl in my throat.
<br/>