使用Beautiful Soup findall在单引号之间提取文本

时间:2015-12-19 15:54:11

标签: python python-2.7 beautifulsoup html-parsing findall

我使用的是美丽的汤,我希望在“#”中提取文字。使用findall方法。

content = urllib.urlopen(address).read()
soup = BeautifulSoup(content, from_encoding='utf-8')
soup.prettify()
x = soup.findAll(do not know what to write)

以汤的提取物为例:

<td class="leftCell identityColumn snap" onclick="fundview('Schroder
European Special Situations');" title="Schroder European Special
Situations"> <a class="coreExpandArrow" href="javascript:
void(0);"></a> <span class="sigill"><a class="qtpop"
href="/vips/ska/all/sv/quicktake/redirect?perfid=0P0000XZZ3&amp;flik=Chosen">
<img
src="/vips/Content/corestyles/4pSigillGubbe.gif"/></a></span>
<span class="bluetext" style="white-space: nowrap; overflow:
hidden;">Schroder European Spe..</span>

我希望soup.findAll(do not know what to write)的结果是:Schroder European Special Situations,而findall逻辑应该基于它是单引号之间的文本。

1 个答案:

答案 0 :(得分:4)

找到td元素并获取onclick属性值 - 此时BeautifulSoup的作业将完成。下一步是从属性值中提取所需的文本 - 让我们使用正则表达式。实现:

import re

onclick = soup.select_one("td.identityColumn[onclick]")["onclick"]

match = re.search(r"fundview\('(.*?)'\);", onclick)
if match:
    print(match.group(1))

或者,看起来span bluetext类的内容包含所需的文字:

soup.select_one("td.identityColumn span.bluetext").get_text()

另外,请确保您使用的是4th BeautifulSoup version,并且您的import语句为:

from bs4 import BeautifulSoup