Question

我正在使用以下方法从URL中提取标题和内容

def extract_title_text(url):
    page = urllib.request.urlopen(url).read().decode('utf8')
    soup = BeautifulSoup(page,'lxml')
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    return soup.title.text, text

URL = 'https://www.bbc.co.uk/news/business-45482461'
titletext, text = extract_title_text(URL)

我想在提取文本时忽略span class =“ off-screen”中的内容。我可以得到一些指针吗，请设置过滤器。

Answer 1

一个非常简单的解决方案是过滤出标签，即：

text = ' '.join(p.text for p in soup.find_all('p') if not "off-screen" in p.get("class", [])

对于更通用的解决方案，soup.find_all()（以及soup.find()）可以take a function as argument，因此您也可以这样做：

def is_content_para(tag):
    return tag.name == "p" and "off-screen" not in p.get("class", [])

text = ' '.join(p.text for p in soup.find_all(is_content_para))

Answer 2

据我所知，该类中没有p元素，但是无论如何您都可以在搜索中对其进行过滤：

soup.find_all(name='p',attrs={'class': lambda x: x != 'off-screen'})

docs对各种查找选项都有详尽的解释。

如何在使用python提取文本时从URL省略特定的类

2 个答案: