使用LXML从文件中打印出单个字符串

时间:2016-04-06 14:36:21

标签: python python-2.7 lxml

我目前正在编写一些内容,它可以获取Shopify网站的内容,并使用LXML将其打印到Python 2.7中的文本文件中。我唯一的问题是使用LXML我只能将所有名称转储到单个字符串中,而不是列出产品名称,然后列出它的URL。目前在store.highsnobiety.com上使用它,这是输出:

Sitemap Products:  ['Copper Bracelet - 3mm - Polished', 'Copper Bracelet - 5mm - Brushed', 'Copper Bracelet - 7mm - Polished', 'Highsnobiety x EARLY - Leather Pouch', u'A Bathing Ape\xae Highsnobiety 10th Anniversary Tee', 'Highsnobiety Magazine Issue 11', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'Highsnobiety x Stampd Snapback - New York', 'Highsnobiety x Stampd Snapback - Berlin', 'Carhartt WIP x Highsnobiety - Hooded Sweatshirt', 'Carhartt WIP x Highsnobiety - Long-Sleeve T-Shirt', 'Carhartt WIP x Highsnobiety - Sweat Pants', 'Carhartt WIP x Highsnobiety - Beanie', 'adidas Consortium x Highsnobiety UltraBOOST', 'adidas Consortium x Highsnobiety Campus 80s', 'Highsnobiety Tonal Logo Snapback - Black', 'Highsnobiety Tonal Logo Snapback - Navy', 'Highsnobiety Tonal Logo Snapback - Red', 'Highsnobiety Tonal Logo Snapback - White', 'Highsnobiety Magazine Issue 9 - Yohji Yamamoto', 'Highsnobiety Magazine Issue 10 - Kobe Bryant', 'Highsnobiety Magazine Issue 10 - Gosha Rubchinskiy', 'Ronnie Fieg x Highsnobiety x Puma RF-Blaze of Glory', 'Ronnie Fieg x Highsnobiety x Puma RF698S']


Sitemap URLs:  ['http://store.highsnobiety.com/', 'http://store.highsnobiety.com/products/highsnobiety-x-simon-me-copper-bracelet-3mm', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety-Bracelet-II-DSC-01.jpg?v=1439215473', 'http://store.highsnobiety.com/products/copy-of-highsnobiety-x-simon-me-copper-bracelet-5mm', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety-Bracelet-II-DSC-02.jpg?v=1439215609', 'http://store.highsnobiety.com/products/copy-of-copy-of-highsnobiety-x-simon-me-copper-bracelet-7mm', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety-Bracelet-II-DSC-03.jpg?v=1439215704', 'http://store.highsnobiety.com/products/highsnobiety-x-early-leather-pouch', 'https://cdn.shopify.com/s/files/1/0279/1227/products/HS3755.jpg?v=1453213731', 'http://store.highsnobiety.com/products/a-bathing-ape-highsnobiety-10th-anniversary-tee', 'https://cdn.shopify.com/s/files/1/0279/1227/products/BAPE_x_Highsnobiety_10_Year_Collaboration_DSC-1390-Edit.jpg?v=1438619379', 'http://store.highsnobiety.com/products/highsnobiety-magazine-issue-11', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_Magazine_Issue_11_DSC-6113-Edit.jpg?v=1442837169', 'http://store.highsnobiety.com/products/anniversary-product-04', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hsb-anniversary-product-6_2048x2048_53b729db-39d1-426e-9ba1-d423fb8dca87.jpeg?v=1443194927', 'http://store.highsnobiety.com/products/a6', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hsb-anniversary-product-6_2048x2048_d6a8d583-8385-4af8-a918-de12282d2dfc.jpg?v=1443195496', 'http://store.highsnobiety.com/products/a7', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hsb-anniversary-product-6_2048x2048_6724dbf4-6589-44ce-a1d4-adc4e1671d74.jpg?v=1443195549', 'http://store.highsnobiety.com/products/a8', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hsb-anniversary-product-6_2048x2048_defa0eb3-9f4d-4a40-a034-f63ad1bec5d3.jpg?v=1443195611', 'http://store.highsnobiety.com/products/a9', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hsb-anniversary-product-6_2048x2048_4f716f1f-a0cd-4fd8-9a59-f756e1c928af.jpg?v=1443195652', 'http://store.highsnobiety.com/products/a10', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hsb-anniversary-product-6_2048x2048_2a8ee68a-baa8-4cc9-b79c-ab633a410caf.jpg?v=1443195786', 'http://store.highsnobiety.com/products/highsnobiety-x-stampd', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Stampd_x_Highsnobiety_10_Year_Collaboration-9418-Edit.jpg?v=1447094980', 'http://store.highsnobiety.com/products/highsnobiety-x-stampd-snapback-berlin', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Stampd_x_Highsnobiety_10_Year_Collaboration-9414-Edit.jpg?v=1447096537', 'http://store.highsnobiety.com/products/carhartt-wip-x-highsnobiety-hoodie', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Carhartt_x_Highsnobiety_10_Year_Collaboration_DSC-9602-Edit.jpg?v=1449071573', 'http://store.highsnobiety.com/products/carhartt-wip-x-highsnobiety-longsleeve', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Carhartt_x_Highsnobiety_10_Year_Collaboration_DSC-9570-Edit_94a500f4-26b5-4178-8540-3f0541a99d7b.jpg?v=1449071483', 'http://store.highsnobiety.com/products/copy-of-carhartt-wip-x-highsnobiety-pants', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Carhartt_x_Highsnobiety_10_Year_Collaboration_DSC-9723-Edit.jpg?v=1449071233', 'http://store.highsnobiety.com/products/copy-of-copy-of-carhartt-wip-x-highsnobiety-beanie', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Carhartt_x_Highsnobiety_10_Year_Collaboration_DSC-9758-Edit.jpg?v=1449071637', 'http://store.highsnobiety.com/products/adidas-consortium-x-highsnobiety-ultraboost', 'https://cdn.shopify.com/s/files/1/0279/1227/products/3A5A3138.jpg?v=1459441917', 'http://store.highsnobiety.com/products/adidas-consortium-x-highsnobiety-campus-80s', 'https://cdn.shopify.com/s/files/1/0279/1227/products/3A5A3142.jpg?v=1459438003', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-black', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_New_Era_Hat_II_DSC_6762.jpg?v=1417004080', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-navy', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_New_Era_Hat_II_DSC_6760.jpg?v=1417005003', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-red', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_New_Era_Hat_II_DSC_6758.jpg?v=1417005281', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-white', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_New_Era_Hat_II_DSC_6756.jpg?v=1417005420', 'http://store.highsnobiety.com/products/highsnobiety-magazine-issue-9-yohji-yamamoto', 'https://cdn.shopify.com/s/files/1/0279/1227/products/hs-magazine-09-01.jpg?v=1411400759', 'http://store.highsnobiety.com/products/copy-of-highsnobiety-magazine-issue-10-kobe-bryant', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_Magazine_Issue_10_DSC-0922.jpg?v=1427706331', 'http://store.highsnobiety.com/products/highsnobiety-magazine-issue-10-gosha-rubchinskiy', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Highsnobiety_Magazine_Issue_10_DSC-0923.jpg?v=1427708119', 'http://store.highsnobiety.com/products/ronnie-fieg-x-highsnobiety-x-puma-rf-blaze-of-glory', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Kith_x_Puma_x_Highsnobiety_10_Year_Collaboration_DSC-7841-Edit.jpg?v=1443118302', 'http://store.highsnobiety.com/products/copy-of-ronnie-fieg-x-highsnobiety-x-puma-rf-blaze-of-glory', 'https://cdn.shopify.com/s/files/1/0279/1227/products/Kith_x_Puma_x_Highsnobiety_10_Year_Collaboration_DSC-7840-Edit.jpg?v=1443178986']

我想将产品名称与URL匹配,如下所示:

Product Name: Copper Bracelet - 3mm Polished
Product URL: http://store.highsnobiety.com/products/highsnobiety-x-simon-me-copper-bracelet-3mm

... and so on

目前的代码是:

from __future__ import print_function
from lxml import html
import requests

# Log file location, change "z://shopify_output.txt" to your location.
log = open("z:\\shopify_output.txt", "w")

# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP):  ") + '/sitemap_products_1.xml'

page = requests.get(url)
tree = html.fromstring(page.content)

productNames = tree.xpath('//title/text()')
productURLS = tree.xpath('//loc/text()')

print('', file = log)
print('Sitemap Products: ', productNames, file = log)
print('', file = log)

print('', file = log)
print('Sitemap URLs: ', productURLS, file = log)
print('', file = log)

有关尝试什么的任何建议?

----------------------------------------------- ---------------

当前代码尝试连接两者:

from __future__ import print_function
from lxml import html
import requests
import time
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

# Log file location, change "z://shopify_output.txt" to your location.
logFileLocation = "z:\shopify_output.txt"

log = open(logFileLocation, "w")

# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP):  ") + '/sitemap_products_1.xml'

print ('Scraping! Check log file @ ' + logFileLocation + ' to see output.')
print ("!!! Also make sure to clear file every hour or so !!!")
while True :


    page = requests.get(url)
    tree = html.fromstring(page.content)

    url_tags =  tree.xpath("//url[image]")

    data = [(e.xpath("./image/title//text()")[0], e.xpath("./loc/text()")[0]) for e in  url_tags]

    for prod, url in data :

        productURL = [e.xpath("./loc/text()")[0] for e in  url_tags]

        productPage = requests.get(productURL)
        productTree = html.fromstring(productPage.content)

        variants = productTree.xpath("//variants[@type='array']//id[@type='integer']//text()")

        print(prod, variants)

1 个答案:

答案 0 :(得分:0)

您需要先找到每个url标记并从每个标记中获取loc和title元素,这样才能保持关联:

url = 'http://store.highsnobiety.com/sitemap_products_1.xml'

page = requests.get(url)
tree = html.fromstring(page.content)

# skip first url tag with no image:title
url_tags =  tree.xpath("//url[position() > 1]")

data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in  url_tags]

数据:

[('Copper Bracelet - 3mm - Polished', 'http://store.highsnobiety.com/products/highsnobiety-x-simon-me-copper-bracelet-3mm'), ('Copper Bracelet - 5mm - Brushed', 'http://store.highsnobiety.com/products/copy-of-highsnobiety-x-simon-me-copper-bracelet-5mm'), ('Copper Bracelet - 7mm - Polished', 'http://store.highsnobiety.com/products/copy-of-copy-of-highsnobiety-x-simon-me-copper-bracelet-7mm'), ('Highsnobiety x EARLY - Leather Pouch', 'http://store.highsnobiety.com/products/highsnobiety-x-early-leather-pouch'), (u'A Bathing Ape\xae Highsnobiety 10th Anniversary Tee', 'http://store.highsnobiety.com/products/a-bathing-ape-highsnobiety-10th-anniversary-tee'), ('Highsnobiety Magazine Issue 11', 'http://store.highsnobiety.com/products/highsnobiety-magazine-issue-11'), ('A5', 'http://store.highsnobiety.com/products/anniversary-product-04'), ('A6', 'http://store.highsnobiety.com/products/a6'), ('A7', 'http://store.highsnobiety.com/products/a7'), ('A8', 'http://store.highsnobiety.com/products/a8'), ('A9', 'http://store.highsnobiety.com/products/a9'), ('A10', 'http://store.highsnobiety.com/products/a10'), ('Highsnobiety x Stampd Snapback - New York', 'http://store.highsnobiety.com/products/highsnobiety-x-stampd'), ('Highsnobiety x Stampd Snapback - Berlin', 'http://store.highsnobiety.com/products/highsnobiety-x-stampd-snapback-berlin'), ('Carhartt WIP x Highsnobiety - Hooded Sweatshirt', 'http://store.highsnobiety.com/products/carhartt-wip-x-highsnobiety-hoodie'), ('Carhartt WIP x Highsnobiety - Long-Sleeve T-Shirt', 'http://store.highsnobiety.com/products/carhartt-wip-x-highsnobiety-longsleeve'), ('Carhartt WIP x Highsnobiety - Sweat Pants', 'http://store.highsnobiety.com/products/copy-of-carhartt-wip-x-highsnobiety-pants'), ('Carhartt WIP x Highsnobiety - Beanie', 'http://store.highsnobiety.com/products/copy-of-copy-of-carhartt-wip-x-highsnobiety-beanie'), ('adidas Consortium x Highsnobiety UltraBOOST', 'http://store.highsnobiety.com/products/adidas-consortium-x-highsnobiety-ultraboost'), ('adidas Consortium x Highsnobiety Campus 80s', 'http://store.highsnobiety.com/products/adidas-consortium-x-highsnobiety-campus-80s'), ('Highsnobiety Tonal Logo Snapback - Black', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-black'), ('Highsnobiety Tonal Logo Snapback - Navy', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-navy'), ('Highsnobiety Tonal Logo Snapback - Red', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-red'), ('Highsnobiety Tonal Logo Snapback - White', 'http://store.highsnobiety.com/products/highsnobiety-logo-snapback-white'), ('Highsnobiety Magazine Issue 9 - Yohji Yamamoto', 'http://store.highsnobiety.com/products/highsnobiety-magazine-issue-9-yohji-yamamoto'), ('Highsnobiety Magazine Issue 10 - Kobe Bryant', 'http://store.highsnobiety.com/products/copy-of-highsnobiety-magazine-issue-10-kobe-bryant'), ('Highsnobiety Magazine Issue 10 - Gosha Rubchinskiy', 'http://store.highsnobiety.com/products/highsnobiety-magazine-issue-10-gosha-rubchinskiy'), ('Ronnie Fieg x Highsnobiety x Puma RF-Blaze of Glory', 'http://store.highsnobiety.com/products/ronnie-fieg-x-highsnobiety-x-puma-rf-blaze-of-glory'), ('Ronnie Fieg x Highsnobiety x Puma RF698S', 'http://store.highsnobiety.com/products/copy-of-ronnie-fieg-x-highsnobiety-x-puma-rf-blaze-of-glory')]

您的代码中的另一个问题是您正在提取不包含image:title标记的第一个网址,因此,即使您将列表压缩在一起,也会丢失数据并且元素不会对齐。

如果我们不知道开始时只有一个没有我们想要的东西,我们只能选择 url 具有图像的节点子:

# only select url nodes that have image child
url_tags =  tree.xpath("//url[image]")
data = [(e.xpath("./image/title//text()")[0], e.xpath("./loc/text()")[0]) for e in  url_tags]

这将为您提供与上述完全相同的输出。

对于其他网址,您需要找到带有type = array属性的 variants 标记,然后使用type='integer'提取ID并从中提取文字:

url ="http://store.highsnobiety.com/products/adidas-consortium-x-highsnobiety-ultraboost.xml"

import requests
from lxml.html import fromstring

page = requests.get(url)
tree = fromstring(page.content)
variants = tree.xpath("//variants[@type='array']//id[@type='integer']//text()")
print(variants)

输出:

['18099668803', '18100253571', '18100253699', '18100253763', '18100253827', '18100253955', '18100254019', '18100254083', '18100254147', '18100254211', '18100254275', '18100254403']

所以将两者结合起来:

url = 'http://store.highsnobiety.com/sitemap_products_1.xml'

page = requests.get(url)
tree = html.fromstring(page.content)

# skip first url tag with no image:title
url_tags =  tree.xpath("//url[position() > 1]")

data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in  url_tags]

for prod, url in data:
    # add xml extension to url
    page = requests.get(url + ".xml"))
    tree = fromstring(page.content)
    variants = tree.xpath("//variants[@type='array']//id[@type='integer']//text()")
    print(prod, variants)

输出片段:

Copper Bracelet - 3mm - Polished ['3723603267']
Copper Bracelet - 5mm - Brushed ['3726247811']
Copper Bracelet - 7mm - Polished ['3726253635']
Highsnobiety x EARLY - Leather Pouch ['14541472963', '14541473027', '14541473091']
A Bathing Ape® Highsnobiety 10th Anniversary Tee ['5279811715', '5765857347', '5765857411', '5765857475']
Highsnobiety Magazine Issue 11 ['7731814659', '7730944131', '7731801347', '7731821763', '7731652675', '7731695683', '7731831363', '7731791747']
A5 ['8133817731']
A6 ['8135296323']
A7 ['8135469443']
A8 ['8135518595']
A9 ['8135556035']