如何摆脱前后的标签?

时间:2016-12-30 02:13:55

标签: python python-2.7 beautifulsoup python-requests

我目前的代码如下:

import requests
from bs4 import BeautifulSoup
url = "http://boost-heaven.com/sitemap_products_1.xml"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
urls = soup.find_all("url")
links = soup.find_all("loc")
title = soup.find_all("image:title")
time = soup.find_all("lastmod")
image = soup.find_all("image:loc")
i = 0
while i <= len(urls) - 1:
    for item in urls:
        if "products" in str(item):
            if "products" in str(links):
                print title[i - 1]
                print links[i]
                print time[i - 1]
                print image[i -1]
        i = i + 1

返回:

<image:title>PIN SWG</image:title>
<loc>http://boost-heaven.com/products/swg-pin</loc>
<lastmod>2016-12-29T06:13:25Z</lastmod>
<image:loc>https://cdn.shopify.com/s/files/1/1490/9704/products/swgpin2.jpgv=1479148164</image:loc>
<image:title>BEANIE</image:title>
<loc>http://boost-heaven.com/products/bg-beanie</loc>
<lastmod>2016-12-29T00:10:45Z</lastmod>
<image:loc>https://cdn.shopify.com/s/files/1/1490/9704/products/redswg.jpgv=1482967350</image:loc>
<image:title>BG FLOORMAT</image:title>
<loc>http://boost-heaven.com/products/bg-floormat</loc>
<lastmod>2016-12-29T09:47:00Z</lastmod>
<image:loc>https://cdn.shopify.com/s/files/1/1490/9704/products/floormatbg1.jpg?v=1482967260</image:loc>
<image:title>BG PABLO BURG</image:title>
<loc>http://boost-heaven.com/products/copy-of-bg-pablo-bn-t</loc>
<lastmod>2016-12-29T09:47:00Z</lastmod>
<image:loc>https://cdn.shopify.com/s/files/1/1490/9704/products/burgundypabloe.jpg?v=1482878401</image:loc>

我想摆脱loc,lastmod和其他标签,只留下文本,但我不知道该怎么做。我还想在lastmod的时间内删除“Z”,并用“at”替换“T”。谢谢。

3 个答案:

答案 0 :(得分:0)

尝试在每个变量进入循环之前重新初始化它们,因为在循环中它最可能采用第一个变量并将其循环到应该传递给循环的其他项中。

答案 1 :(得分:0)

我只在每个代码中获取所有<url>个代码和搜索元素。

import requests
from bs4 import BeautifulSoup

url = "http://boost-heaven.com/sitemap_products_1.xml"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

# skip first element which has no data
all_urls = soup.find_all("url")[1:]

for url in all_urls:
    print('image:title:', url.find('image:title').get_text())
    print('        loc:', url.find('loc').get_text())
    # skip last char - "Z"
    print('    lastmod:', url.find('lastmod').get_text().replace("T", " at ")[:-1])
    print('  image:loc:', url.find('image:loc').get_text())
    print('---')

您也可以使用这两行而不使用find()

print('        loc:', url.loc.get_text())
print('    lastmod:', url.lastmod.get_text().replace("T", " at ")[:-1])

您可以使用get_text()代替text - 即。 url.find('image:title').text

答案 2 :(得分:0)

您的内部循环每次都循环遍历所有相同的元素,而不是与外部循环中当前图像链接相关的元素。变量的最终值来自每个列表的最后一个元素,因此每次都得到相同的值。

您应该循环遍历<url>元素,然后查找其中的特定项目。

import requests
from bs4 import BeautifulSoup

url = "http://boost-heaven.com/sitemap_products_1.xml"
r = requests.get(url)
soup = BeautifulSoup(r.content)

for url in soup.find_all("url"):
    titlenode = url.find("image:title")
    if titlenode:
        title = titlenode.text
        loc = url.find("loc").text
        lastmod = url.find("lastmod").text
        imageloc = url.find("image:loc").text
        print title + "\n" + loc + "\n" + lastmod + "\n" + imageloc