Question

仅当表格包含特定的子字符串时，我才尝试从网站中提取表格。

我使用请求打开URL，并使用pandas.html提取表。但是，通过执行此操作，我要么按索引提取所有表，要么提取特定表，并且我想找到一种仅提取具有我的关键字的表的方法。

public class ExitUsage
{
   public static void Main()
   {
      string[] args = Environment.GetCommandLineArgs();
      if (args.Length == 1) {
         Environment.Exit(-1);  
      }

      //Do stuff that requires the length of args to be not equal to 1.

   }
}

从这里我可以打印import requests import pandas as pd #url is the website, html opens the site and df_list is extracting all tables url = 'https://www.sec.gov/Archives/edgar/data/880432/000114420415073214/v427721_def14a.htm' html = requests.get(url).content df_list = pd.read_html(html)，但是我想要带有关键字的表格。我尝试了以下方法：（均未返回任何内容）

df_list[index]

如果我尝试：

for i in range(len(df_list)):
    if 'Fees Earned' in df_list:
        print (df_list[i])

for i in range(len(df_list)):
    if any("Fees Earned" in s for s in df_list):
        print(df_list[i])

我只收到输出“ False”

Answer 1

这应该为您提供表格：

import bs4 as BeautifulSoup

soup = BeautifulSoup(html)
table = soup.select_one('table:contains("Fees Earned")')

要将其转换为熊猫数据框：

df = pd.read_html(str(table))

您可能需要先清理表，然后再将其导出到excel。

Answer 2

也许这会起作用：

for df in df_list:
   new_df=df.dropna(how='all').dropna(axis=1,how='any')
   if "Fees Earned" in str(df.iloc[:,:]):
        print(new_df)

使用Pandas通过字符串从网站查找特定表

2 个答案: