如何从sitemap.xml文件创建列表以在python中提取url?

时间:2017-01-21 15:24:23

标签: python xml python-2.7 web-scraping beautifulsoup

我需要创建一个代码来从一次图像中提取一个单词。 我将从一个页面sitemap.xml解释,我的代码必须在这个xml文件中的每个链接中尝试,如果有特定的单词,则在图像链接中找到每个链接。

站点地图是adidas = http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml

这是我为搜索创建的代码,图像中包含单词“ZOOM”:

import requests
from bs4 import BeautifulSoup

 html = requests.get(
'http://www.adidas.it/scarpe-superstar/C77124.html').text
 bs = BeautifulSoup(html)
 possible_links = bs.find_all('img')
 for link in possible_links:
  if link.has_attr('src'):
    if link.has_key('src'):
        if 'zoom' in link['src']:
            print link['src']

但我正在搜索metod以自动删除列表

非常感谢

我尝试这样做有列表:

from bs4 import BeautifulSoup
import requests

 url = "http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml"

r = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

for url in soup.findAll("loc"):
print url.text

但我无法附上请求..

我可以在sitemap.xml中的任何链接中找到“缩放”一词

非常感谢

1 个答案:

答案 0 :(得分:1)

import requests
from bs4 import BeautifulSoup
import re

def make_soup(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    return soup
# put urls in a list
def get_xml_urls(soup):
    urls = [loc.string for loc in soup.find_all('loc')]
    return urls
# get the img urls
def get_src_contain_str(soup, string):
    srcs = [img['src']for img in soup.find_all('img', src=re.compile(string))]
    return srcs
if __name__ == '__main__':
    xml = 'http://www.adidas.it/on/demandware.static/-/Sites-adidas-IT-Library/it_IT/v/sitemap/product/adidas-IT-it-it-product.xml'
    soup = make_soup(xml)
    urls = get_xml_urls(soup)
    # loop through the urls
    for url in urls:
        url_soup = make_soup(url)
        srcs = get_src_contain_str(url_soup, 'zoom')
        print(srcs)