Question

我是新手，我从BeautifulSoup和Python dev开始，我希望得到一个没有任何HTML标签或其他非文本元素的全文结果。

我用python做了这个：

#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

html_content = urllib2.urlopen("http://www.demo.com/index.php")

soup = BeautifulSoup(html_content, "lxml")

# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")


# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")

# PRINT RESULT
print count_comment_final
print count_read_final

我的HTML看起来像这样：

<div class="box">
      <span class="sidebar-comment__label">Comments</span>
      <meta itemprop="interactionCount" content="Comments:115">
</div>


<div class="box">
      <span class="sidebar-read__label js-read">Read</span>
      <meta itemprop="interactionCount" content="Read:10">
</div>

我明白了：

<meta content="Comments:115" itemprop="interactionCount"/>
<meta content="Read:10" itemprop="interactionCount"/>

我会得到这个：

You've 115 comments
You've 10 read

首先，有可能吗？

其次，我的代码是好的吗？

第三，你能帮助我吗？ ; - ）

Answer 1

count_comment_final和count_read_final是从输出中清楚看到的标签。您需要提取两个标记的属性content的值。这是使用count_comment_final['content']完成的，它将Comments:115，Comments:使用split(':')

剥离

#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

html_content = urllib2.urlopen("http://www.demo.com/index.php")

soup = BeautifulSoup(html_content, "lxml")

# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")


# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")

# PRINT RESULT
print count_comment_final['content'].split(':')[1]
print count_read_final['content'].split(':')[1]

{{1}}

Answer 2

count_comment_final和count_read_final是标记元素，你可以用，

count_comment_final.get('content')

这将提供这样的输出，

'Comments:115'

所以你可以把评论视为，

count_comment_final.get('content').split(':')[1]

同样适用于count_read_final，

count_read_final.get('content').split(':')[1]

使用BeautifulSoup解析并获得具有特殊格式的结果

2 个答案: