如何在python中隔离网页的主要文本?

时间:2014-11-15 19:12:15

标签: python web beautifulsoup

我必须从网页中保存文本并使用其他功能来总结文本。问题是,我的摘要最终会在广告中显示网页中各种内容的奇怪文本。我使用BeautifulSoup来提取文本。这是文本提取的代码:

def web_crawler():
    userinput = str(input("Enter a valid Web Page URL: "))
    url = urllib.urlopen(userinput).read()
    #add exception here for internet connection not avalaible
    soup = BeautifulSoup(url.decode('utf8'))
    [s.extract() for s in soup('script')]   #remove javascriptlinks
    [s.extract() for s in soup('style')]    #remove css
    [s.extract() for s in soup('a')]    # remove links
    title = str(soup.title).strip("<title>")
    title = title.strip("</")
    htmlText = soup.get_text()
    htmlText = ' '.join(htmlText.split())   #remove unnecessary whitspace
    textFile = open("textFile.txt", mode = "w", encoding = "utf8")
    textFile.write(htmlText)    #save text file to use in memory friendly version
    textFile.close()
    #for now return the title query and the article text
    return (title, htmlText)

例如,我想总结一下这个网页的文字内容:

enter image description here

当我对文字进行总结时,我会从侧面的广告和功能中获取文字。有没有办法只能从网页上抓取主体的文字?

0 个答案:

没有答案