我是新手,我从BeautifulSoup和Python dev开始,我希望得到一个没有任何HTML标签或其他非文本元素的全文结果。
我用python做了这个:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.php")
soup = BeautifulSoup(html_content, "lxml")
# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")
# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")
# PRINT RESULT
print count_comment_final
print count_read_final
我的HTML看起来像这样:
<div class="box">
<span class="sidebar-comment__label">Comments</span>
<meta itemprop="interactionCount" content="Comments:115">
</div>
<div class="box">
<span class="sidebar-read__label js-read">Read</span>
<meta itemprop="interactionCount" content="Read:10">
</div>
我明白了:
<meta content="Comments:115" itemprop="interactionCount"/>
<meta content="Read:10" itemprop="interactionCount"/>
我会得到这个:
You've 115 comments
You've 10 read
首先,有可能吗?
其次,我的代码是好的吗?
第三,你能帮助我吗? ; - )
答案 0 :(得分:1)
count_comment_final
和count_read_final
是从输出中清楚看到的标签。您需要提取两个标记的属性content
的值。这是使用count_comment_final['content']
完成的,它将Comments:115
,Comments:
使用split(':')
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.php")
soup = BeautifulSoup(html_content, "lxml")
# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")
# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")
# PRINT RESULT
print count_comment_final['content'].split(':')[1]
print count_read_final['content'].split(':')[1]
{{1}}
答案 1 :(得分:1)
count_comment_final
和count_read_final
是标记元素,
你可以用,
count_comment_final.get('content')
这将提供这样的输出,
'Comments:115'
所以你可以把评论视为,
count_comment_final.get('content').split(':')[1]
同样适用于count_read_final
,
count_read_final.get('content').split(':')[1]