无法使用搜索关键字解析网页中的某些信息

时间:2019-10-25 18:40:51

标签: python python-3.x web-scraping

我创建了一个脚本来解析与网站中某些歌曲相关的某些信息。当我尝试使用this linkthis one时,我的scrpt可以正常工作。我能理解的是,当我在https://www.billboard.com/music/这部分之后附加搜索关键字时,我得到了所需的信息页面。

但是,当我尝试使用这些关键字1 Of The GirlsAl B. Sure!Ashford & Simpson等等时,出现了问题。

我不知道如何在基本链接https://www.billboard.com/music/后面附加上述关键字以找到包含信息的页面。

我尝试过的脚本:

import requests
from bs4 import BeautifulSoup

LINK = "https://www.billboard.com/music/Adele"

res = requests.get(LINK)
soup = BeautifulSoup(res.text,"lxml")
scores = [item.text for item in soup.select("[class$='-history__stats'] > p > span")]
print(scores)

我得到的结果(符合预期):

['4 No. 1 Hits', '6 Top 10 Hits', '13 Songs']

该页面中的结果位于chart history之后,如下所示:

enter image description here

如何使用关键搜索关键字从网页中获取一些信息?

1 个答案:

答案 0 :(得分:3)

我不知道所有用例,但在提到的用例中,我看到的明显模式是特殊字符被去除(不留空格),单词被小写,然后用“-”代替空格。棘手的地方可能是特殊字符的定义和处理。

例如

https://www.billboard.com/music/ashford-simpson

https://www.billboard.com/music/al-b-sure

https://www.billboard.com/music/1-of-the-girls

您可以先编写一些东西来执行那些字符串操作,然后测试响应代码。也许看看js文件中是否有任何形式的验证。

编辑:

单词之间的多个空格在变为“-”之前变成了单个空格?

与@Mithu一起开发的用于准备搜索字词的答案:

import re
keywords = ["Y?N-Vee","Ashford & Simpson","Al B. Sure!","1 Of The Girls"]
spec_char = ["!","#","$","%","&","'","(",")","*","+",",",".","/",":",";","<","=",">","?","@","[","]","^","_","`","{","|","}","~",'"',"\\"]

for elem in keywords:
    refined_keywords = re.sub('-+','-' , ''.join(i.replace(" ","-") for i in elem.lower() if i not in spec_char))
    print(refined_keywords)