使用Pandas通过字符串从网站查找特定表

时间:2019-07-10 19:07:28

标签: python python-3.x pandas python-requests

Picture Of Table

仅当表格包含特定的子字符串时,我才尝试从网站中提取表格。

我使用请求打开URL,并使用pandas.html提取表。但是,通过执行此操作,我要么按索引提取所有表,要么提取特定表,并且我想找到一种仅提取具有我的关键字的表的方法。

public class ExitUsage
{
   public static void Main()
   {
      string[] args = Environment.GetCommandLineArgs();
      if (args.Length == 1) {
         Environment.Exit(-1);  
      }

      //Do stuff that requires the length of args to be not equal to 1.

   }
}

从这里我可以打印import requests import pandas as pd #url is the website, html opens the site and df_list is extracting all tables url = 'https://www.sec.gov/Archives/edgar/data/880432/000114420415073214/v427721_def14a.htm' html = requests.get(url).content df_list = pd.read_html(html) ,但是我想要带有关键字的表格。我尝试了以下方法:(均未返回任何内容)

df_list[index]

如果我尝试:

for i in range(len(df_list)):
    if 'Fees Earned' in df_list:
        print (df_list[i])

for i in range(len(df_list)):
    if any("Fees Earned" in s for s in df_list):
        print(df_list[i])

我只收到输出“ False”

2 个答案:

答案 0 :(得分:1)

这应该为您提供表格:

import bs4 as BeautifulSoup

soup = BeautifulSoup(html)
table = soup.select_one('table:contains("Fees Earned")')

要将其转换为熊猫数据框:

df = pd.read_html(str(table))

您可能需要先清理表,然后再将其导出到excel。

答案 1 :(得分:0)

也许这会起作用:

for df in df_list:
   new_df=df.dropna(how='all').dropna(axis=1,how='any')
   if "Fees Earned" in str(df.iloc[:,:]):
        print(new_df)