我正在尝试从here抓取文本以直接输入到excel工作表中,而不是复制和粘贴。该网站使用HTML来包含有关原始字体的信息。这是一个如何在页面上编码一行文本的示例:
<div class="line">
<span class="milestone_wrap"> </span>
<a id="tln-2212" href="index.html#tln-2212" class="milestone tln invisible" title="TLN: 2212">2212</a>
<span class="milestone_wrap">When </span>
<span class="typeform" data-setting="ſ">s</span>
<span class="milestone_wrap">uch ill dealing mu</span>
<span class="ligature" data-precomposed="ſt">
<span class="typeform" data-setting="ſ">s</span>
<span class="milestone_wrap">t</span>
</span>
<span class="milestone_wrap"> be </span>
<span class="typeform" data-setting="ſ">s</span>
<span class="milestone_wrap">eene in thought. </span>
<span class="sd exit">
<span class="space" style="padding-right:1em;" xml:space="preserve"></span>
<i>Exit</i>
<span class="milestone_wrap">.</span>
</span>
</div>
我尝试使用find_all方法
import requests
from bs4 import BeautifulSoup as bs
url = 'https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html'
page = requests.get(url)
text = bs(page.text, 'html.parser')
divs = text.find_all('div', class_="line")
for div in divs:
for item in div.contents: print(item)
这就是我得到的:
When
<span class="typeform" data-setting="ſ">s</span>
uch ill dealing mu
<span class="ligature" data-precomposed="ſt"><span class="typeform" data-setting="ſ">s</span>t</span>
be
<span class="typeform" data-setting="ſ">s</span>
eene in thought.
<span class="sd exit"><span class="space" style="padding-right:1em;" xml:space="preserve"> </span><i>Exit</i>.</span>
带有标签<span class="milestone_wrap">
的所有内容都没有标签:因此,当我将.find_all用作'span'时,这些字符串就不会出现,所以我剩下的是随机字母。没有出现该类的原因吗?
答案 0 :(得分:0)
在稍加调整地执行代码时(必须导入请求模块),您应该获得网站的内容。
from bs4 import BeautifulSoup as bs
import requests
url = 'https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html'
page = requests.get(url)
text = bs(page.text, 'html.parser')
divs = text.find_all('div', class_="line")
for div in divs:
for item in div.contents: print(item)
可以在<span class="milestone_wrap">
标签内找到文本。您可以使用浏览器的检查器进行检查。文本以小部分的方式逐个标签地传递,例如“在”。您应该能够提取文本。
答案 1 :(得分:0)
在行类级别上工作,但是分解a标记以便删除行号(除非您真的想要它们),在这种情况下,我将在行号和以下文本之间添加空格
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html')
soup = bs(r.content, 'lxml')
for line in soup.select('.line'):
line.select_one('a').decompose()
print(line.text)