从网页中提取单个href

时间:2015-06-26 16:00:23

标签: python regex web-scraping beautifulsoup

我正在编写一个代码,我必须提取一个单独的href链接,我面临的问题是它提取了两个链接,除了最后一个ID部分以外都有相同的内容,我有一个ID,我只是想从链接中提取另一个。这是我的代码: -

    import requests,re
    from bs4 import BeautifulSoup
    url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
    r=requests.get(url)
    soup=BeautifulSoup(r.content)
    g_1=soup.find_all("div",{"class":"color-scroll"})
    for item in g_1:
          a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
          for elem in a_1:
               print elem['href']

我得到的输出是: -

         /on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758921
         /on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758910

我有第一个ID,即500758921,我想提取另一个。 请帮忙。提前谢谢!

3 个答案:

答案 0 :(得分:0)

为每个链接运行此正则表达式

^/on/demandware.store/Sites-BNY-Site/default/Product-Variation\?pid=([0-9]+)

从最后一个正则表达式组中获取结果。

答案 1 :(得分:0)

如果您需要除第一个链接之外的所有链接,只需切片find_all()

的结果
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
    print link['href']

切片工作的原因是find_all()返回一个基于常规Python列表的ResultSet实例:

class ResultSet(list):
    """A ResultSet is just a list that keeps track of the SoupStrainer
    that created it."""
    def __init__(self, source, result=()):
        super(ResultSet, self).__init__(result)
        self.source = source

要从您获得的链接中提取pid,您可以使用正则表达式搜索在{em>捕获组中保存pid值:

import re

pattern = re.compile("pid=(\w+)")
for item in g_1:
    links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
    for link in links[1:]:
        match = pattern.search(link["href"])
        if match:
            print match.group(1)

答案 2 :(得分:0)

这可能会:

import requests,re
from bs4 import BeautifulSoup
def getPID(url):
    return re.findall('(\d+)',url.rstrip('.html'))
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
having_pid = getPID(url)
print(having_pid)
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
    a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
    for elem in a_1:
        if (getPID(elem['href'])[0] not in having_pid):
           print elem['href']