从XML中获取URL以便使用Python进行抓取

时间:2018-09-26 21:06:06

标签: python xml python-3.x

我正在尝试制作一个从XML网站地图文件中收集数据的刮板。我在下面编写了程序。输入一个静态URL时,它可以正常工作。我下载了包含产品所有URL的XML页面。有没有一种方法可以提取它们并为它们中的每一个创建一个以使过程自动化?

XML文件如下所示:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="xx" type="text/xsl"?>
<urlset xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" 
    xmlns:xhtml="http://www.w3.org/1999/xhtml" 
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>(URL IS HERE)</loc>
        <changefreq>daily</changefreq>
        <image:image>
            <image:loc>(URL OF PICTURE, not relevant</image:loc>
       </image:image>
    </url>

代码看起来像这样

from bs4 import BeautifulSoup
import requests

filename = "products.csv"
f = open(filename, "w")

headers = "Naam, prijs \n" 

f.write(headers)

print('step 1')
#get url
page_link = "<privacy>"
print('step 2')
#open page
page_response = requests.get(page_link, timeout=1)
print('step 3')
#parse page
page_content = BeautifulSoup(page_response.content, "html.parser")
print('step 4')
#naam van de pagina
price = page_content.find_all(class_='<privacy>')[0].decode_contents()
naam = page_content.find_all(class_='product-name')[0].decode_contents()
print('step 5')
#printen
print("Product:", naam, "kost nu", price)

f.write(naam + "," + price.replace(",", "|") +  "\n")
f.close()

0 个答案:

没有答案