Question

我希望能够使用python https://yeezysupply.com/pages/all从以下网页中提取所有网址。我尝试使用我发现的其他建议，但他们似乎并没有使用此特定网站。我最终根本找不到任何网址。

import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): 
    print link

Answer 1

也许您可以使用专门为此设计的模块。这是一个快速而又脏的脚本，可以获取页面上的相关链接

#!/usr/bin/python3

import requests, bs4

res = requests.get('https://yeezysupply.com/pages/all')

soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')

for link in links:
    print(link.attrs['href'])

它生成如下输出：

/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...

这是你要找的？请求和美丽的汤是令人惊叹的刮刮工具。

Answer 2

页面源中没有链接;在浏览器中加载页面后，使用Javascript插入它们。

我想使用python从某个网页获取所有链接

2 个答案: