<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>
使用Python我想从锚标记中获取值,这些值应该是粗糙集和模糊集视图中基于粒度计算的数据挖掘
我尝试使用lxml
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)
print rawResponse
并获得以下输出
['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]
答案 0 :(得分:3)
您可以使用text_content
方法:
import lxml.html as LH
html = '''<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>'''
root = LH.fromstring(html)
for elt in root.xpath('//a'):
print(elt.text_content())
产量
Granular computing based
data
mining
in the views of rough set and fuzzy set
或者,要删除空格,可以使用
print(' '.join(elt.text_content().split()))
获取
Granular computing based data mining in the views of rough set and fuzzy set
这是您可能会觉得有用的另一个选项:
print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))
产量
Granular computing based data mining in the views of rough set and fuzzy set
(注意,它会在data
和mining
之间留出额外的空格。)
'//a/descendant-or-self::text()'
是一个更通用的版本
"//a/child::text() | //a/span/child::text()"
。它将遍历所有子孙等等。
答案 1 :(得分:1)
>>> from bs4 import BeautifulSoup
>>> html = (the html you posted above)
>>> soup = BeautifulSoup(html)
>>> print " ".join(soup.h3.text.split())
Granular computing based data mining in the views of rough set and fuzzy set
说明:
BeautifulSoup
解析HTML,使其易于访问。 soup.h3
访问HTML中的h3
标记。
.text
,简单地从h3
标记中获取所有内容,排除所有其他标记,例如span
。
我在这里使用split()
来删除多余的空格和换行符,然后" ".join()
,因为split函数返回一个列表。