如何从这个html代码中以一种很好的方式提取文本'ROYAL PYTHON'? 我一直在寻找4小时的解决方案,但我没有发现任何真正相关且有效的方法。
<div class="definicja"><a href="javascript: void(0);"
onclick="play('/mp3/1/81/c5ebfe33a08f776931d69857169f0442.mp3')"
class="ikona_sluchaj2"></a> <a href="/slownik/angielsko_polski/,royal+python">ROYAL
PYTHON</a></div>
答案 0 :(得分:2)
正如Joel Cornett所说,使用BeautifulSoup这样:
from bs4 import BeautifulSoup
html = '''<div class="definicja"><a href="javascript: void(0);" onclick="play('/mp3/1/81/c5ebfe33a08f776931d69857169f0442.mp3')" class="ikona_sluchaj2"></a> <a href="/slownik/angielsko_polski/,royal+python">ROYAL PYTHON</a></div>'''
soup = BeautifulSoup(html)
print soup.getText()
答案 1 :(得分:0)
您可以使用lxml和xpath:
from lxml.html.soupparser import fromstring
s = 'yourhtml'
h = fromstring(s)
print h.xpath('//div[@class="definicja"]/a[2]/text()')[0]
答案 2 :(得分:0)
假设有以下几点:(1)HTML代码段始终是有效的XHTML,(2)您正在寻找代码段中第二个锚标记内的文本
from xml.dom.minidom import parseString
htmlString = """<pre><div class="definicja"><a href="javascript: void(0);" onclick="play('/mp3/1/81/c5ebfe33a08f776931d69857169f0442.mp3')" class="ikona_sluchaj2"><img src="/images/ikona_sluchaj2.gif" alt=""/></a> <a href="/slownik/angielsko_polski/,royal+python">ROYAL PYTHON</a></div></pre>"""
xmlDoc = parseString(htmlString)
anchorNodes = xmlDoc.getElementsByTagName("a")
secondAnchorNode = anchorNodes[1]
textNode = secondAnchorNode.childNodes[0]
print textNode.nodeValue
xml包含在Python中,因此您不必担心安装任何软件包。
答案 3 :(得分:0)
还有标准模块xml.etree.ElementTree
import xml.etree.ElementTree as ET
fragment = '''<pre>
<div class="definicja"><a href="javascript: void(0);"
onclick="play('/mp3/1/81/c5ebfe33a08f776931d69857169f0442.mp3')"
class="ikona_sluchaj2"><img src="/images/ikona_sluchaj2.gif" alt=""
/></a> <a href="/slownik/angielsko_polski/,royal+python">ROYAL
PYTHON</a></div>
</pre>'''
frg = ET.fromstring(fragment)
for a in frg.findall('div/a'):
if a.text is not None:
print a.text
print '------'
print ' '.join(a.text.split()) # all words to one line
它在我的控制台上打印
ROYAL
PYTHON
------
ROYAL PYTHON