我必须从网页中保存文本并使用其他功能来总结文本。问题是,我的摘要最终会在广告中显示网页中各种内容的奇怪文本。我使用BeautifulSoup来提取文本。这是文本提取的代码:
def web_crawler():
userinput = str(input("Enter a valid Web Page URL: "))
url = urllib.urlopen(userinput).read()
#add exception here for internet connection not avalaible
soup = BeautifulSoup(url.decode('utf8'))
[s.extract() for s in soup('script')] #remove javascriptlinks
[s.extract() for s in soup('style')] #remove css
[s.extract() for s in soup('a')] # remove links
title = str(soup.title).strip("<title>")
title = title.strip("</")
htmlText = soup.get_text()
htmlText = ' '.join(htmlText.split()) #remove unnecessary whitspace
textFile = open("textFile.txt", mode = "w", encoding = "utf8")
textFile.write(htmlText) #save text file to use in memory friendly version
textFile.close()
#for now return the title query and the article text
return (title, htmlText)
例如,我想总结一下这个网页的文字内容:
当我对文字进行总结时,我会从侧面的广告和功能中获取文字。有没有办法只能从网页上抓取主体的文字?