Question

我正试图从电视节目中删除一些脚本。我可以使用BeautifulSoup和Requests获取我需要的文本。

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.example.com')
s = BeautifulSoup(r.text, 'html.parser')

for p in s.find_all('p'):
    print p.text

到目前为止这种方法很有效。但我只想要某个角色的那些段落。说他的名字是“stackoverflow”。案文如下：

答：sdasd sd asda B：sdasds STACKOVERFLOW：帮助？

所以我只想要STACKOVERFLOW所说的东西。不是其余的。

我试过了

s.find_all(text='STACKOVERFLOW') but I get nothing.

这样做的正确方法是什么？在正确的方向提示将是最受欢迎的。

Answer 1

使部分文本匹配，或者使用：

s.find_all(text=lambda text: text and 'STACKOVERFLOW' in text)

或者：

import re

s.find_all(text=re.compile('STACKOVERFLOW'))

Answer 2

您可以将自定义函数传递到find_all。此函数应该接受一个参数（标记）并返回符合您标准的标记True。

def so_tags(tag):
    '''returns True if the tag has text and 'stackoverflow' is in the text'''
    return (tag.text and "STACKOVERFLOW" in tag.text)

soup.find_all(my_tags)

您还可以创建一个函数工厂，使其更具动态性。

def user_paragraphs(user):
    '''returns a function'''
    def user_tags(tag):
        '''returns True for tags that have <user> in the text'''
        return (tag.text and user in tag.text)
    return user_tags

for user in user_list:
    user_posts = soup.find_all(user_paragraphs(user))

BeautifulSoup - 仅在找到某个字符串时才在标记内获取文本

2 个答案: