我正在尝试使用以下html从网站中提取特定文本:
...
<tr>
<td>
<strong>
Location:
</strong>
</td>
<td colspan="3">
90 km S. of Prince Rupert
</td>
</tr>
...
我想提取“位置:”之后的文本(即“鲁珀特王子的90公里S.”)。有一大堆类似的网站,我想循环并抓住“位置:”后面的文字
我是python的新手,并且无法找到基于这样的条件提取文本的解决方案。
答案 0 :(得分:2)
我的理解是BS不会处理格式错误的HTML以及LXML。但是,我可能错了,但我通常使用lxml来处理这些类型的问题。下面是一些您可以使用的代码,以便更好地了解如何使用元素。有很多方法。
在我看来,获得lxml的最佳地点是here
from lxml import html
ms = '''<tr>
<td>
<strong>
Location:
</strong>
</td>
<td colspan="3">
90 km S. of Prince Rupert
</td>
<mytag>
Hello World
</mytag>
</tr>'''
mytree = html.fromstring(ms) #this creates a 'tree' in memory
for e in mytree.iter(): # iterate through the elements
if e.tag == 'td': #focus on the elements that are td elements
if 'location' in e.text_content().lower(): # if location is in the text of a td
for sib in e.itersiblings(): # find all the siblings of the td
sib.text_content() # print the text
'\ n在鲁珀特王子港90公里处。\
这里有很多东西需要学习,但lxml非常内省
>>> help (e.itersiblings)
Help on built-in function itersiblings:
itersiblings(...)
itersiblings(self, tag=None, preceding=False)
Iterate over the following or preceding siblings of this element.
The direction is determined by the 'preceding' keyword which
defaults to False, i.e. forward iteration over the following
siblings. When True, the iterator yields the preceding
siblings in reverse document order, i.e. starting right before
the current element and going left. The generated elements
can be restricted to a specific tag name with the 'tag'
keyword.
注意 - 我稍微更改了字符串并添加了mytag,因此请根据itersiblings的帮助查看新代码
for e in mytree.iter():
if e.tag == 'td':
if 'location' in e.text_content().lower():
for sib in e.itersiblings(tag = 'mytag'):
sib.text_content()
'\n hello world\n