BeautifulSoup / Regex:从href中查找特定值

时间:2018-01-26 21:53:20

标签: javascript python html regex beautifulsoup

使用下面的代码,并尝试在href的末尾找到值。有没有办法提取href,并在BeutifulSoup / Regex中找到page=之后的值?

from bs4 import BeautifulSoup
import requests
import json
import re

request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')

findNext = soup.find("a", class_="next_page")
print(findNext)

获得此输出:

<a class="next_page" href="/quotes/tag/fun?page=2" rel="next">next »</a>

注意:想要从上面或任何其他可能出现的号码中提取2

5 个答案:

答案 0 :(得分:1)

您可以使用regex查找页码:

from bs4 import BeautifulSoup
import re
request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')
page_nums = re.findall('(?<=page\=)\d+', str(soup.find("a", class_="next_page")))[0]

输出:

2

答案 1 :(得分:1)

from bs4 import BeautifulSoup
import requests    

request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')

findNext = soup.find("a", class_="next_page").attrs['href'].split('page=')[1]
print(findNext)
#Result is 2

答案 2 :(得分:0)

使用Regex,您可以执行类似的操作,

    let url = "/quotes/tag/fun?page=2";
    let urlParam = url.substring(url.indexOf('?') + 1);
    let matches = urlParam.match(/=(.+)/);
    let username;
    if (matches) {
        username = matches[1];
    }
    return username;

答案 3 :(得分:0)

&#13;
&#13;
var text = '<a class="next_page" href="/quotes/tag/fun?page=2" rel="next">next »</a>';
var regex = /(?<=href=\")[^\?]+\?page=(\d+)(?=\")/
var match = regex.exec(text);

console.log("**href => " + match[0] + " **page => " + match[1]);
&#13;
&#13;
&#13;

Regex demo

答案 4 :(得分:0)

使用JavaScript,您可以使用URL构造函数,.search获取查询字符串参数,String.prototype.split()字符为"="Array.prototype.pop()

var param = new URL('https://www.goodreads.com/quotes/tag/fun?page=1')
            .search.split("=").pop();

console.log(param);