我想提取:
image
标记和div
类数据我成功设法提取img src,但是无法从锚标记中提取文本。
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
以下是整个HTML page的链接。
这是我的代码:
for div in soup.findAll('div', attrs={'class':'image'}):
print "\n"
for data in div.findNextSibling('div', attrs={'class':'data'}):
for a in data.findAll('a', attrs={'class':'title'}):
print a.text
for img in div.findAll('img'):
print img['src']
我想要做的是提取图片src(链接)和div class=data
中的标题,例如:
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
应提取:
Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)
答案 0 :(得分:46)
这会有所帮助:
from bs4 import BeautifulSoup
data = '''<div class="image">
<a href="http://www.example.com/eg1">Content1<img
src="http://image.example.com/img1.jpg" /></a>
</div>
<div class="image">
<a href="http://www.example.com/eg2">Content2<img
src="http://image.example.com/img2.jpg" /> </a>
</div>'''
soup = BeautifulSoup(data)
for div in soup.findAll('div', attrs={'class':'image'}):
print(div.find('a')['href'])
print(div.find('a').contents[0])
print(div.find('img')['src'])
如果您正在研究亚马逊产品,那么您应该使用官方API。至少one Python package可以减轻您的抓取问题并使您的活动符合使用条款。
答案 1 :(得分:18)
就我而言,它的工作原理如下:
from BeautifulSoup import BeautifulSoup as bs
url="http://blabla.com"
soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.string
希望它有所帮助!
答案 2 :(得分:5)
我建议使用lxml路径并使用xpath。
from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')
答案 3 :(得分:3)
所有上述答案真的帮助我构建我的答案,因为我投票支持其他用户提出的所有答案:但我最终把我自己的答案放在我正在处理的确切问题上:
当问题明确定义时,我必须以dom结构访问一些兄弟姐妹及其子节点:此解决方案将迭代dom结构中的图像并使用产品标题构建图像名称并将图像保存到本地目录。
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests
def getImages(url):
#Download the images
r = requests.get(url)
html = r.text
soup = bs(html)
output_folder = '~/amazon'
#extracting the images that in div(s)
for div in soup.findAll('div', attrs={'class':'image'}):
modified_file_name = None
try:
#getting the data div using findNext
nextDiv = div.findNext('div', attrs={'class':'data'})
#use findNext again on previous object to get to the anchor tag
fileName = nextDiv.findNext('a').text
modified_file_name = fileName.replace(' ','-') + '.jpg'
except TypeError:
print 'skip'
imageUrl = div.find('img')['src']
outputPath = os.path.join(output_folder, modified_file_name)
urlretrieve(imageUrl, outputPath)
if __name__=='__main__':
url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
getImages(url)
答案 4 :(得分:1)
>>> txt = '<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> '
>>> fragment = bs4.BeautifulSoup(txt)
>>> fragment
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'})
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'}).string
u'Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)'
答案 5 :(得分:0)
print(link_addres.contents[0])
它将打印锚标记的上下文
示例:
statement_title = statement.find('h2',class_='briefing-statement__title')
statement_title_text = statement_title.a.contents[0]
答案 6 :(得分:0)
要从锚标记中获取 href 使用 tag.get("href")
,并使用 tag.img.get("src")
获取 img src。
示例,使用此数据:
data = """
<div class="image">
<a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a>
</div>
<div class="image">
<a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a>
</div>
"""
获取链接和文本:
import requests
from bs4 import BeautifulSoup
def get_soup(url):
response = requests.get(url)
if response.ok:
return BeautifulSoup(response.text, features="html.parser")
def get_links(soup):
links = []
for tag in soup.findAll("a", href=True):
if img := tag.img:
img = img.get("src")
links.append(dict(url=tag.get("href"), text=tag.text, img=img))
return links
# soup = get_soup('www.example.com')
soup = BeautifulSoup(data, features="html.parser")
links = get_links(soup)
输出:
[{'url': 'http://www.example.com/eg1', 'text': 'Content1', 'img': 'http://image.example.com/img1.jpg'},
{'url': 'http://www.example.com/eg2', 'text': 'Content2 ', 'img': 'http://image.example.com/img2.jpg'}]