Retrieving a subset of href's from findall() in BeautifulSoup

时间:2017-04-06 17:21:25

标签: python python-2.7 web-scraping beautifulsoup lxml

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.

import requests
# The Requests library.

from bs4 import BeautifulSoup
from lxml import html

user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input

header = {'User-Agent':''}
response = requests.get(base_url, headers=header)

soup = BeautifulSoup(response.content, "lxml")

for link in soup.find_all('a',href=True):
        print (link['href'])

This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.

https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands@genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated

My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?

P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.

1 个答案:

答案 0 :(得分:3)

使用正则表达式模块仅匹配您想要的链接。

file2

输出:

urlPattern: "'/'"

这只是查看您的链接是否与import requests # The Requests library. from bs4 import BeautifulSoup from lxml import html from re import compile user_input = input("Enter Artist Name = ").replace(" ","+") base_url = "https://genius.com/search?q="+user_input header = {'User-Agent':''} response = requests.get(base_url, headers=header) soup = BeautifulSoup(response.content, "lxml") pattern = re.compile("[\S]+-lyrics$") for link in soup.find_all('a',href=True): if pattern.match(link['href']): print (link['href']) 中结尾的模式匹配。您也可以使用类似的逻辑来使用https://genius.com/Drake-hotline-bling-lyrics https://genius.com/Drake-one-dance-lyrics https://genius.com/Drake-hold-on-were-going-home-lyrics https://genius.com/Drake-know-yourself-lyrics https://genius.com/Drake-back-to-back-lyrics https://genius.com/Drake-all-me-lyrics https://genius.com/Drake-0-to-100-the-catch-up-lyrics https://genius.com/Drake-started-from-the-bottom-lyrics https://genius.com/Drake-from-time-lyrics https://genius.com/Drake-the-motto-lyrics 变量进行过滤。

希望这有帮助。