将NASDAQ HTML表读取到数据框

时间:2019-01-13 20:16:57

标签: python-3.x pandas parsing dataframe data-processing

我使用此代码从纳斯达克获取了最新的交易公司列表,但是我希望将结果显示在数据框中,而不是仅列出我可能不需要的所有其他信息。

任何想法如何实现?谢谢

解析最新的纳斯达克公司

    from bs4 import BeautifulSoup
    import requests

    r=requests.get('https://www.nasdaq.com/screening/companies-by 
    industry.aspx 
    exchange=NASDAQ&sortname=marketcap&sorttype=1&pagesize=4000')
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    table = soup.find( "table", {"id":"CompanylistResults"} )
    for row in table.findAll("tr"):
        for cell in row("td"):
            print (cell.get_text().strip())

1 个答案:

答案 0 :(得分:2)

看起来您正在寻找恰当命名的read_html,尽管您需要四处寻找直到获得所需的东西。就您而言:

>>> import pandas as pd
>>> df=pd.read_html(table.prettify(),flavor='bs4')[0]
>>> df.columns = [c.strip() for c in df.columns]

请参见下面的输出。

第一行是完成工作的内容,第二行只是去除标题中所有讨厌的空格和新行。似乎有一个隐藏的ADR TSO似乎没有用,因此如果您不知道它是什么,可以将其删除。丢弃所有偶数行也可能很有意义,因为它们只是奇数行的延续,而据我所知,它们是无用的链接。在一行中:

>>> df = df.drop(['ADR TSO'], axis=1) #Drop useless column
>>> df1= df[::2] #To get rid of even rows
>>> df2= df[~df['Name'].str.contains('Stock Quote')].head() #By string filtration if we are not sure about the odd/even thing

原始头的输出仅用于显示:

>>> df.head()
                                                Name Symbol Market Cap  \
0                                   Amazon.com, Inc.   AMZN   $802.18B
1  AMZN Stock Quote  AMZN Ratings  AMZN Stock Report    NaN        NaN
2                              Microsoft Corporation   MSFT   $789.12B
3  MSFT Stock Quote  MSFT Ratings  MSFT Stock Report    NaN        NaN
4                                      Alphabet Inc.  GOOGL    $740.3B

   ADR TSO        Country IPO Year  \
0      NaN  United States     1997
1      NaN            NaN      NaN
2      NaN  United States     1986
3      NaN            NaN      NaN
4      NaN  United States      n/a

                                         Subsector
0                   Catalog/Specialty Distribution
1                                              NaN
2          Computer Software: Prepackaged Software
3                                              NaN
4  Computer Software: Programming, Data Processing

已清除的df.head() 的输出:

                    Name Symbol Market Cap        Country IPO Year  \
0       Amazon.com, Inc.   AMZN   $802.18B  United States     1997
2  Microsoft Corporation   MSFT   $789.12B  United States     1986
4          Alphabet Inc.  GOOGL    $740.3B  United States      n/a
6          Alphabet Inc.   GOOG   $735.24B  United States     2004
8             Apple Inc.   AAPL    $720.3B  United States     1980

                                         Subsector
0                   Catalog/Specialty Distribution
2          Computer Software: Prepackaged Software
4  Computer Software: Programming, Data Processing
6  Computer Software: Programming, Data Processing
8                           Computer Manufacturing