我正在编写一个程序来寻找歌词,程序差不多要完成但我对bs4数据类型有点问题, 我的问题是如何从行尾的歌词变量中提取纯文本?
import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
webFileName = unquote(fileName)
page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")
match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
if match:
Mached = str("http://"+match.group())
return(Mached[:-1:]) # this line used to remove a " at the end of line
else:
return(1)
def getText(link):
page = requests.get(str(link))
soup = bs4.BeautifulSoup(page.content ,"lxml")
return(soup)
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)
这是出局:
[\ n \ t \ t \ t \ t \ t \ t \ t \ t请允许你停止噪音,我想要休息一下<\ n>从所有未出生的鸡的声音在我的脑袋里\ n那是什么?左\ n那是什么?
\ n \ n \ n当我成为国王时,你将首先靠墙
\ n。你的意见完全没有意义。\ n那是什么?
\ n那是什么?
\ n \ n \ \ nAmbition让你看起来很丑陋郎\ n嘻嘻哈哈的Gucci小猪崽子\ n你不记得了\ n你不记得了\ n为什么不这样做&#39你还记得我的名字吗?左边的男人用头,\ man man man man man man man
W W W W W W W W W W我猜他确实没有下雨,下雨了。\ n下雨时我就趴在了一个很高的地方。\ n从很高的高度来看,高度
\ n降下雨,下雨了\ n降雨量降低了我\ n从很高的高度来看\ n \ n从高度,高度,\ n \ n下降下雨了。\ n下雨对我来说,\ n \ \ n那就是,先生,\ n你要离开了。\ n猪皮的噼啪声。\ n灰尘和尖叫的\ \雅皮士网络\ The恐慌,呕吐< / \ \ \ \
\ \ \ \ \ \ \,,,,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
]
答案 0 :(得分:0)
首先修剪前导和尾随[],然后执行stringvar[1:-1]
,然后在每一行调用linevar.strip()
,这将删除所有空格。
答案 1 :(得分:0)
附加以下代码行:
lyric = ''.join([tag.text for tag in lyric])
之后
lyric = Soup.findAll(attrs={"lyric-box"})
您将获得类似
的输出 Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...
答案 2 :(得分:0)
对于那些喜欢这个想法的人来说,最后我的代码看起来有点变化:)
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
fileName = unquote(fileName)
baseAddres = "https://songmeanings.com/query/?query="
linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToPage)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
tab_content = str(soup.find_all(attrs={"tab-content"}))
pattern = r'\"\/\/songmeanings.com\/.+?\"'
links = re.findall(pattern,tab_content)
"""returns first mached item without double quote
at the beginning and at the end of the string"""
return("http:"+links[0][1:-1:])
def getText(linkToSong):
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToSong)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
lyric_box = soup.find_all(attrs={"lyric-box"})
lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
return(lyric_boxSTR)
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)
&#13;