我正在拉网页上的列表并给它们上下文,我也在它们前面拉文本。拉<ul>
或<ol>
标记之前的标记似乎是最好的方法。所以我想说我有这个清单:
我想要拔掉头发并说出“千禧一代”。我使用BeautifulSoup函数:
#pull <ul> tags
def pull_ul(tag):
return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]
当我打印ul_with_context时,我得到以下内容:
['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']
正如你所看到的,“千禧一代”并未被拉扯。我要摘的页面是http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/ 这是子弹的代码部分:
<p>
和<ul>
标签是兄弟姐妹。知道为什么它没有用“千禧一代”这个词来拉动标签吗?
答案 0 :(得分:-1)
Previous_sibling
将返回标记前面的元素或字符串。在您的情况下,它返回字符串'\n'
。
相反,您可以使用findPrevious method获取所选节点之前的节点:
doc = """
<h2>test</h2>
<ul>
<li>1</li>
<li>2</li>
</ul>
"""
soup = BeautifulSoup(doc, 'html.parser')
tags = soup.find_all('ul')
print [ul.findPrevious() for ul in tags]
print tags
将输出:
[<h2>test</h2>]
[<ul><li>1</li><li>2</li></ul>]