我正试图从html文件中直接提取一行文本到一个变量中,但是,尽管进行了数小时的搜索,我仍未找到解决该问题的方法。 Beautiful Soup看起来很有帮助,我如何能够简单地选择所需的字符串作为输入,然后从html源中将其提取到变量中?
我一直在尝试使用request.text和“漂亮的汤”来刮擦整个页面,但是似乎没有直接执行此操作的功能。
from urllib.request import urlopen
from bs4 import BeautifulSoup
def extract(url):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all('<DIV ALIGN="justify"')]
<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">
<TABLE WIDTH="75%" ALIGN="center">
<TR>
<TD>
<DIV ALIGN="center"><H1>STARTING . . . </H1></DIV>
<DIV ALIGN="justify"><P>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML.
<BR>
<P>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</P>
运行时,我希望它返回字符串
<P>There are lots of ways to create web pages