从python中的给定url中删除两列

时间:2018-04-30 06:56:27

标签: python python-3.x web-scraping beautifulsoup

我必须从Trending tickers - Yahoo中删除数据,我只需要从表格中获取公司的符号和名称,我已经为整个表格编写了代码,但我怎么能得到所需的列?

我的代码是:

import requests
from bs4 import BeautifulSoup
import pandas


url = 'https://finance.yahoo.com/trending-tickers'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
table = soup.find("table",{"class":"yfinlist-table W(100%) BdB Bdc($tableBorderGray)"})

tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
tableHeaders = [th.text for th in table.find_all("th")]
df = pandas.DataFrame(tableRows,columns = tableHeaders)

print(df)

1 个答案:

答案 0 :(得分:0)

您已正确删除了所有行。但是,要仅获取符号和名称,您必须逐个循环遍历行并在每次迭代中获取它们。如果右键单击并检查任何符号,则可以看到该文本位于<a>标记内。相应的<a>代码具有以下格式:

<a href="/quote/S?p=S" title="Sprint Corporation" data-symbol="S" class="Fw(b)" data-reactid="57">S</a>

如您所见,该名称包含在具有title属性的标记中。因此,要获得两者,符号和名称,您只需要获取标记。因为它是行中的第一个 <a>标记,您只需使用row.find('a')即可获得它。

完整代码:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://finance.yahoo.com/trending-tickers')
soup = BeautifulSoup(r.text, 'lxml')

table = soup.find('table', class_='yfinlist-table W(100%) BdB Bdc($tableBorderGray)')
for row in table.find_all('tr')[1:]:
    a_tag = row.find('a')
    symbol = a_tag.text
    name = a_tag['title']
    print(symbol, name)

输出:

S Sprint Corporation
TMUS T-Mobile US, Inc.
AAPL Apple Inc.
^HSI HANG SENG INDEX
^N225 Nikkei 225
000001.SS SSE Composite Index
WMT Walmart Inc.
NKE NIKE, Inc.
^FTSE FTSE 100
^AORD ALL ORDINARIES
BTC-USD Bitcoin USD
CL=F Crude Oil
MCD McDonald's Corporation
AUDUSD=X AUD/USD
KO The Coca-Cola Company
DIS The Walt Disney Company
GBPUSD=X GBP/USD
GERN Geron Corporation
^NSEI NIFTY 50
TSLA Tesla, Inc.
VZ Verizon Communications Inc.
EURUSD=X EUR/USD
^BSESN S&P BSE SENSEX
GC=F Gold
0700.HK Tencent Holdings Limited
^KS11 KOSPI Composite Index