尝试使用beautifulsoup刮取shopify网站,使用findAll('url')
返回一个空列表。如何检索所需的内容?
import requests
from bs4 import BeautifulSoup as soupify
import lxml
webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = webSite.text
webSite.close()
pageSource = soupify(pageSource, "xml")
print(pageSource.findAll('url'))
我要抓取的页面:https://launch.toytokyo.com/sitemap_pages_1.xml
我得到的是一个空列表
我应该得到的东西:不是空白列表
感谢大家的帮助,在我的代码中找出了问题,我使用的是findAll而不是find_all
答案 0 :(得分:4)
您可以这样做:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
soup = bs(requests.get(url).content,'html.parser')
urls = [i.text for i in soup.find_all('loc')]
所以基本上我可以得到内容并找到包含URL的loc标签,然后获取内容;)
已更新:必需的url标记并生成字典
urls = [i for i in soup.find_all('url')]
s = [[{k.name:k.text} for k in urls[i] if not isinstance(k,str)] for i,_ in enumerate(urls)]
使用pprint中的import pprint作为打印件,以获得s的精美照片:
print(s)
注意:您可以使用lxml解析器,因为它比html.parser快
答案 1 :(得分:2)
作为BeautifulSoup
的替代方法,您始终可以使用xml.etree.ElementTree
来解析位于loc
标签处的XML网址:
from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from pprint import pprint
url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
req = get(url)
tree = ElementTree(fromstring(req.text))
urls = []
for outer in tree.getroot():
for inner in outer:
namespace, tag = inner.tag.split("}")
if tag == 'loc':
urls.append(inner.text)
pprint(urls)
将在列表中提供以下URL:
['https://launch.toytokyo.com/pages/about',
'https://launch.toytokyo.com/pages/help',
'https://launch.toytokyo.com/pages/terms',
'https://launch.toytokyo.com/pages/visit-us']
由此,您可以将信息分组为collections.defaultdict
:
from requests import get
from xml.etree.ElementTree import fromstring, ElementTree
from collections import defaultdict
from pprint import pprint
url = 'https://launch.toytokyo.com/sitemap_pages_1.xml'
req = get(url)
tree = ElementTree(fromstring(req.text))
data = defaultdict(dict)
for i, outer in enumerate(tree.getroot()):
for inner in outer:
namespace, tag = inner.tag.split("}")
data[i][tag] = inner.text
pprint(data)
哪个给出以下以字典为键的字典的默认字典:
defaultdict(<class 'dict'>,
{0: {'changefreq': 'weekly',
'lastmod': '2018-07-26T14:37:12-07:00',
'loc': 'https://launch.toytokyo.com/pages/about'},
1: {'changefreq': 'weekly',
'lastmod': '2018-11-26T07:58:43-08:00',
'loc': 'https://launch.toytokyo.com/pages/help'},
2: {'changefreq': 'weekly',
'lastmod': '2018-08-02T08:57:58-07:00',
'loc': 'https://launch.toytokyo.com/pages/terms'},
3: {'changefreq': 'weekly',
'lastmod': '2018-05-21T15:02:36-07:00',
'loc': 'https://launch.toytokyo.com/pages/visit-us'}})
如果您希望按类别分组,则可以使用默认的列表列表:
data = defaultdict(list)
for outer in tree.getroot():
for inner in outer:
namespace, tag = inner.tag.split("}")
data[tag].append(inner.text)
pprint(data)
其中给出了不同的结构:
defaultdict(<class 'list'>,
{'changefreq': ['weekly', 'weekly', 'weekly', 'weekly'],
'lastmod': ['2018-07-26T14:37:12-07:00',
'2018-11-26T07:58:43-08:00',
'2018-08-02T08:57:58-07:00',
'2018-05-21T15:02:36-07:00'],
'loc': ['https://launch.toytokyo.com/pages/about',
'https://launch.toytokyo.com/pages/help',
'https://launch.toytokyo.com/pages/terms',
'https://launch.toytokyo.com/pages/visit-us']})
答案 2 :(得分:2)
使用xpath的另一种方式
Tuple.to_list/1
答案 3 :(得分:0)
我试图确切地显示出您已经尝试过的方式。您唯一需要纠正的是webSite.text
。如果改用webSite.content
,则可以得到有效的响应。
这是您现有尝试的更正版本:
import requests
from bs4 import BeautifulSoup
webSite = requests.get('https://launch.toytokyo.com/sitemap_pages_1.xml')
pageSource = BeautifulSoup(webSite.content, "xml")
for k in pageSource.find_all('url'):
link = k.loc.text
date = k.lastmod.text
frequency = k.changefreq.text
print(f'{link}\n{date}\n{frequency}\n')