BeautifulSoup找不到xml标记,我该如何解决?

时间:2018-12-31 10:15:49

标签: python web-scraping beautifulsoup

尝试使用beautifulsoup刮取shopify网站,使用findAll('url')返回一个空列表。如何检索所需的内容?

import requests
from bs4 import BeautifulSoup as soupify
import lxml

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = webSite.text
webSite.close()

pageSource = soupify(pageSource, "xml")
print(pageSource.findAll('url'))

我要抓取的页面:https://launch.toytokyo.com/sitemap_pages_1.xml

我得到的是一个空列表

我应该得到的东西:不是空白列表

感谢大家的帮助,在我的代码中找出了问题,我使用的是findAll而不是find_all

4 个答案:

答案 0 :(得分:4)

您可以这样做:

import requests
from bs4 import BeautifulSoup as bs

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

soup = bs(requests.get(url).content,'html.parser')


urls = [i.text for i in soup.find_all('loc')]

所以基本上我可以得到内容并找到包含URL的loc标签,然后获取内容;)

已更新:必需的url标记并生成字典

urls = [i for i in soup.find_all('url')]

s = [[{k.name:k.text} for k in urls[i] if not isinstance(k,str)] for i,_ in enumerate(urls)]

使用pprint中的import pprint作为打印件,以获得s的精美照片:

print(s)

注意:您可以使用lxml解析器,因为它比html.parser快

答案 1 :(得分:2)

作为BeautifulSoup的替代方法,您始终可以使用xml.etree.ElementTree来解析位于loc标签处的XML网址:

from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from pprint import pprint

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

req = get(url)
tree = ElementTree(fromstring(req.text))

urls = []
for outer in tree.getroot():
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        if tag == 'loc':
            urls.append(inner.text)

pprint(urls)

将在列表中提供以下URL:

['https://launch.toytokyo.com/pages/about',
 'https://launch.toytokyo.com/pages/help',
 'https://launch.toytokyo.com/pages/terms',
 'https://launch.toytokyo.com/pages/visit-us']

由此,您可以将信息分组为collections.defaultdict

from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from collections import defaultdict
from pprint import pprint

url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'

req = get(url)
tree = ElementTree(fromstring(req.text))

data = defaultdict(dict)
for i, outer in enumerate(tree.getroot()):
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        data[i][tag] = inner.text

pprint(data)

哪个给出以下以字典为键的字典的默认字典:

defaultdict(<class 'dict'>,
            {0: {'changefreq': 'weekly',
                 'lastmod': '2018-07-26T14:37:12-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/about'},
             1: {'changefreq': 'weekly',
                 'lastmod': '2018-11-26T07:58:43-08:00',
                 'loc': 'https://launch.toytokyo.com/pages/help'},
             2: {'changefreq': 'weekly',
                 'lastmod': '2018-08-02T08:57:58-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/terms'},
             3: {'changefreq': 'weekly',
                 'lastmod': '2018-05-21T15:02:36-07:00',
                 'loc': 'https://launch.toytokyo.com/pages/visit-us'}})

如果您希望按类别分组,则可以使用默认的列表列表:

data = defaultdict(list)
for outer in tree.getroot():
    for inner in outer:
        namespace, tag = inner.tag.split("}")
        data[tag].append(inner.text)

pprint(data)

其中给出了不同的结构:

defaultdict(<class 'list'>,
            {'changefreq': ['weekly', 'weekly', 'weekly', 'weekly'],
             'lastmod': ['2018-07-26T14:37:12-07:00',
                         '2018-11-26T07:58:43-08:00',
                         '2018-08-02T08:57:58-07:00',
                         '2018-05-21T15:02:36-07:00'],
             'loc': ['https://launch.toytokyo.com/pages/about',
                     'https://launch.toytokyo.com/pages/help',
                     'https://launch.toytokyo.com/pages/terms',
                     'https://launch.toytokyo.com/pages/visit-us']})

答案 2 :(得分:2)

使用xpath的另一种方式

Tuple.to_list/1

答案 3 :(得分:0)

我试图确切地显示出您已经尝试过的方式。您唯一需要纠正的是webSite.text。如果改用webSite.content,则可以得到有效的响应。

这是您现有尝试的更正版本:

import requests
from bs4 import BeautifulSoup

webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = BeautifulSoup(webSite.content, "xml")
for k in pageSource.find_all('url'):
    link = k.loc.text
    date = k.lastmod.text
    frequency = k.changefreq.text
    print(f'{link}\n{date}\n{frequency}\n')