我创建了一个脚本来解析与网站中某些歌曲相关的某些信息。当我尝试使用this link或this one时,我的scrpt可以正常工作。我能理解的是,当我在https://www.billboard.com/music/
这部分之后附加搜索关键字时,我得到了所需的信息页面。
但是,当我尝试使用这些关键字1 Of The Girls
或Al B. Sure!
或Ashford & Simpson
等等时,出现了问题。
我不知道如何在基本链接https://www.billboard.com/music/
后面附加上述关键字以找到包含信息的页面。
我尝试过的脚本:
import requests
from bs4 import BeautifulSoup
LINK = "https://www.billboard.com/music/Adele"
res = requests.get(LINK)
soup = BeautifulSoup(res.text,"lxml")
scores = [item.text for item in soup.select("[class$='-history__stats'] > p > span")]
print(scores)
我得到的结果(符合预期):
['4 No. 1 Hits', '6 Top 10 Hits', '13 Songs']
该页面中的结果位于chart history
之后,如下所示:
如何使用关键搜索关键字从网页中获取一些信息?
答案 0 :(得分:3)
我不知道所有用例,但在提到的用例中,我看到的明显模式是特殊字符被去除(不留空格),单词被小写,然后用“-”代替空格。棘手的地方可能是特殊字符的定义和处理。
例如
https://www.billboard.com/music/ashford-simpson
https://www.billboard.com/music/al-b-sure
https://www.billboard.com/music/1-of-the-girls
您可以先编写一些东西来执行那些字符串操作,然后测试响应代码。也许看看js文件中是否有任何形式的验证。
编辑:
单词之间的多个空格在变为“-”之前变成了单个空格?
与@Mithu一起开发的用于准备搜索字词的答案:
import re
keywords = ["Y?N-Vee","Ashford & Simpson","Al B. Sure!","1 Of The Girls"]
spec_char = ["!","#","$","%","&","'","(",")","*","+",",",".","/",":",";","<","=",">","?","@","[","]","^","_","`","{","|","}","~",'"',"\\"]
for elem in keywords:
refined_keywords = re.sub('-+','-' , ''.join(i.replace(" ","-") for i in elem.lower() if i not in spec_char))
print(refined_keywords)