我正在编写一个代码,我必须提取一个单独的href链接,我面临的问题是它提取了两个链接,除了最后一个ID部分以外都有相同的内容,我有一个ID,我只是想从链接中提取另一个。这是我的代码: -
import requests,re
from bs4 import BeautifulSoup
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
print elem['href']
我得到的输出是: -
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758921
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758910
我有第一个ID,即500758921,我想提取另一个。 请帮忙。提前谢谢!
答案 0 :(得分:0)
为每个链接运行此正则表达式
^/on/demandware.store/Sites-BNY-Site/default/Product-Variation\?pid=([0-9]+)
从最后一个正则表达式组中获取结果。
答案 1 :(得分:0)
如果您需要除第一个链接之外的所有链接,只需切片find_all()
:
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
print link['href']
切片工作的原因是find_all()
返回一个基于常规Python列表的ResultSet
实例:
class ResultSet(list):
"""A ResultSet is just a list that keeps track of the SoupStrainer
that created it."""
def __init__(self, source, result=()):
super(ResultSet, self).__init__(result)
self.source = source
要从您获得的链接中提取pid
,您可以使用正则表达式搜索在{em>捕获组中保存pid
值:
import re
pattern = re.compile("pid=(\w+)")
for item in g_1:
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
match = pattern.search(link["href"])
if match:
print match.group(1)
答案 2 :(得分:0)
这可能会:
import requests,re
from bs4 import BeautifulSoup
def getPID(url):
return re.findall('(\d+)',url.rstrip('.html'))
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
having_pid = getPID(url)
print(having_pid)
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
if (getPID(elem['href'])[0] not in having_pid):
print elem['href']