此网站https://www.investing.com/commodities/real-time-futures
上有一张桌子。表格类别ID为cross_rate_1。
我正在尝试获取与表标题中每个项目关联的所有超链接,// // [@@ =“ =” cross_rate_1“] / tbody / tr [2] / td [2] / a。
每个项目的标签的位置为td class="bold left plusIconTd noWrap elp"
,其中的位置为a title
和href
。
我尝试了以下代码
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/commodities/real-time-futures"
req = requests.get(url, headers=urlheader)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="cross_rate_1")
但是我只得到了表格本身,我也尝试过
links= soup.findAll("td", { "class" : "href" },)
但是它变成空白。
如何创建如下所示的表:
Commodity Hyperlink
Gold https://www.investing.com/commodities/gold
XAU/USD https://www.investing.com/currencies/xau-usd
.....
答案 0 :(得分:3)
很简单:
import requests
from bs4 import BeautifulSoup
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/commodities/real-time-futures"
req = requests.get(url, headers=urlheader)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="cross_rate_1")
for a in table.findAll('a'):
text = a.text
url = a.get("href")
print(text, url)
# Or do what you want