我正在尝试从网站上抓取文字,但无法弄清楚如何删除无关的div标签。代码如下:
import requests
from bs4 import BeautifulSoup
team_urls =
['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in team_urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for e in soup.find_all('br'):
e.replace_with('\n')
lyrics = soup.find(class_='dn')
print(lyrics)
这给了我一个输出:
<div class="dn" id="content_h">The club isn't the best place...
我想删除div标签。
答案 0 :(得分:0)
完整代码:
import requests
from bs4 import BeautifulSoup
urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in urls:
page = requests.get(url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.select_one('#content_h')
for e in div.find_all('br'):
e.replace_with('\n')
lyrics = div.text
print(lyrics)
请注意,有时会使用错误的编码:
我可能会发疯,不要介意我
这就是我手动设置它的原因:page.encoding = 'utf-8'
。 requests docs提到这种情况的片段:
响应内容的编码仅根据HTTP标头确定,遵循RFC 2616到字母。如果您可以利用非HTTP知识来更好地猜测编码,则应在访问此属性之前适当地设置r.encoding。
答案 1 :(得分:-1)
您可以使用正则表达式
import requests
import re
from bs4 import BeautifulSoup
team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in team_urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for e in soup.find_all('br'):
e.replace_with('\n')
lyrics = soup.find(class_='dn')
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', lyrics.text)
print(cleantext)
将删除&lt;之间的所有内容。和&gt;
使用python docs中提到的特殊字符
&#34;
。 (点。)在默认模式下,它匹配除换行之外的任何字符。如果指定了DOTALL标志,则匹配包括换行符在内的任何字符。
* 使得到的RE匹配前面RE的0或更多次重复,尽可能多的重复。 ab *将匹配'a','ab'或'a',后跟任意数量的'b'。
? 使得到的RE匹配前面RE的0或1次重复。 AB?将匹配'a'或'ab'。
&#34;