使用BeautifulSoup解析并获得具有特殊格式的结果

时间:2014-09-25 04:58:03

标签: python beautifulsoup lxml

我是新手,我从BeautifulSoup和Python dev开始,我希望得到一个没有任何HTML标签或其他非文本元素的全文结果。

我用python做了这个:

#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

html_content = urllib2.urlopen("http://www.demo.com/index.php")

soup = BeautifulSoup(html_content, "lxml")

# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")


# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")

# PRINT RESULT
print count_comment_final
print count_read_final

我的HTML看起来像这样:

<div class="box">
      <span class="sidebar-comment__label">Comments</span>
      <meta itemprop="interactionCount" content="Comments:115">
</div>


<div class="box">
      <span class="sidebar-read__label js-read">Read</span>
      <meta itemprop="interactionCount" content="Read:10">
</div>

我明白了:

<meta content="Comments:115" itemprop="interactionCount"/>
<meta content="Read:10" itemprop="interactionCount"/>

我会得到这个:

You've 115 comments
You've 10 read

首先,有可能吗?

其次,我的代码是好的吗?

第三,你能帮助我吗? ; - )

2 个答案:

答案 0 :(得分:1)

count_comment_finalcount_read_final是从输出中清楚看到的标签。您需要提取两个标记的属性content的值。这是使用count_comment_final['content']完成的,它将Comments:115Comments:使用split(':')

剥离#!/usr/bin/env python import urllib2 from bs4 import BeautifulSoup html_content = urllib2.urlopen("http://www.demo.com/index.php") soup = BeautifulSoup(html_content, "lxml") # COMMENTS COUNT count_comment = soup.find("span", "sidebar-comment__label") count_comment count_comment_final = count_comment.find_next("meta") # READ COUNT count_read = soup.find("span", "sidebar-read__label js-read") count_read count_read_final = count_read.find_next("meta") # PRINT RESULT print count_comment_final['content'].split(':')[1] print count_read_final['content'].split(':')[1]
{{1}}

答案 1 :(得分:1)

count_comment_finalcount_read_final是标记元素, 你可以用,

count_comment_final.get('content')

这将提供这样的输出,

'Comments:115'

所以你可以把评论视为,

count_comment_final.get('content').split(':')[1]

同样适用于count_read_final

count_read_final.get('content').split(':')[1]