使用python请求和bs4获取表中的所有超链接

时间:2020-05-23 04:30:30

标签: python html beautifulsoup python-requests

此网站https://www.investing.com/commodities/real-time-futures上有一张桌子。表格类别ID为cross_rate_1。

我正在尝试获取与表标题中每个项目关联的所有超链接,// // [@@ =“ =” cross_rate_1“] / tbody / tr [2] / td [2] / a。

每个项目的标签的位置为td class="bold left plusIconTd noWrap elp",其中的位置为a titlehref

我尝试了以下代码

urlheader = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://www.investing.com/commodities/real-time-futures"
req = requests.get(url, headers=urlheader)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="cross_rate_1")

但是我只得到了表格本身,我也尝试过

links= soup.findAll("td", { "class" : "href" },)

但是它变成空白。

如何创建如下所示的表:

Commodity  Hyperlink 
Gold       https://www.investing.com/commodities/gold
XAU/USD    https://www.investing.com/currencies/xau-usd
.....

1 个答案:

答案 0 :(得分:3)

很简单:

import requests
from bs4 import BeautifulSoup

urlheader = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://www.investing.com/commodities/real-time-futures"
req = requests.get(url, headers=urlheader)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="cross_rate_1")

for a in table.findAll('a'):
    text = a.text
    url = a.get("href")
    print(text, url)
    # Or do what you want