Question

在我的public BloggingContext(DbContextOptions<BloggingContext> options) : base(options) { }中，我意识到到最后我得到了与我的文章完全无关的无关信息。摆脱不相关信息的通用方法是什么？

text_scraper(page_soup)

Answer 1

如果只需要与文章相关的文本，则可以在text_scraper方法中调整指针，而在<p>中仅废弃<section>标签：

def text_scraper(page_soup):
    text_body = ''
    # Find only the text related to the article:
    article_section = page_soup.find('section',{'class':'body'})
    # Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
    for p in article_section.find_all('p'):
        if p.previousSibling and p.previousSibling.name is not 'em':
            text_body += p.text
    return(text_body)

然后，该文章将在页脚内没有文本的情况下返回（希瑟·奈特是专栏作家和他们的奋斗。）

编辑：添加了对父级的测试，以避免最后一部分旧金山纪事[...] Twitter：@hknightsf

如何仅提取文章正文的某些部分？

1 个答案: