仅当表格包含特定的子字符串时,我才尝试从网站中提取表格。
我使用请求打开URL,并使用pandas.html提取表。但是,通过执行此操作,我要么按索引提取所有表,要么提取特定表,并且我想找到一种仅提取具有我的关键字的表的方法。
public class ExitUsage
{
public static void Main()
{
string[] args = Environment.GetCommandLineArgs();
if (args.Length == 1) {
Environment.Exit(-1);
}
//Do stuff that requires the length of args to be not equal to 1.
}
}
从这里我可以打印import requests
import pandas as pd
#url is the website, html opens the site and df_list is extracting all tables
url = 'https://www.sec.gov/Archives/edgar/data/880432/000114420415073214/v427721_def14a.htm'
html = requests.get(url).content
df_list = pd.read_html(html)
,但是我想要带有关键字的表格。我尝试了以下方法:(均未返回任何内容)
df_list[index]
如果我尝试:
for i in range(len(df_list)):
if 'Fees Earned' in df_list:
print (df_list[i])
for i in range(len(df_list)):
if any("Fees Earned" in s for s in df_list):
print(df_list[i])
我只收到输出“ False”
答案 0 :(得分:1)
这应该为您提供表格:
import bs4 as BeautifulSoup
soup = BeautifulSoup(html)
table = soup.select_one('table:contains("Fees Earned")')
要将其转换为熊猫数据框:
df = pd.read_html(str(table))
您可能需要先清理表,然后再将其导出到excel。
答案 1 :(得分:0)
也许这会起作用:
for df in df_list:
new_df=df.dropna(how='all').dropna(axis=1,how='any')
if "Fees Earned" in str(df.iloc[:,:]):
print(new_df)