我正在尝试将此html中的最后一个表转换为数据表。
代码如下:
import numpy as np
# convert series to datetime if necessary
for col in ['deadline', 'delivered']:
df1[col] = pd.to_datetime(df1[col], dayfirst=True)
for col in ['deadline', 'delivered']:
df2[col] = pd.to_datetime(df2[col], dayfirst=True)
# create series mapping key to delivered date in df1
s = df1.set_index('key')['delivered']
# define conditions and values
conditions = [~df2['key'].isin(s.index), df2['key'].map(s) <= df2['deadline']]
values = [np.nan, 'project delivered before deadline']
# apply conditions and values, with fallback value
df2['In Project1'] = np.select(conditions, values, 'Project delayed')
print(df2)
key name deadline delivered In Project1
0 AA1 Tom 2018-05-01 2018-04-30 Project delayed
1 AA2 Sue 2018-05-01 2018-04-30 project delivered before deadline
2 AA3 Jim 2018-05-01 2018-05-03 nan
如您所见,它已将其读入,但需要清理。我的问题是给有使用此功能经验的人的。最好先读一遍,然后再尝试对其进行清理?如果有人知道该怎么做,请发布一些代码。谢谢。
答案 0 :(得分:1)
Code
使用pd.read_html()
从网站中提取表格。可以根据table format
进一步调整其他参数。
# Import libraries
import pandas as pd
# Read table
link = 'https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm'
a=pd.read_html(link, header=None, skiprows=1)
# Save the dataframe
df = a[23]
# Remove NaN rows/columns
col_list = data.iloc[1]
df = data.loc[4:,[0,1,3,5,7,9,11]] # adjusted column names
df.columns = col_list[:len(df.columns)]
df.head(7)
注意:原始表格中的空白单元格被NaN代替
答案 1 :(得分:1)
总是最好清除原始数据,因为任何处理都可能引入伪像。您的HTML表是使用span
功能创建的,因此,如果在HTML解析后清除DataFrame
,就无法以通用方式提取数据。因此,我建议您安装一个专门用于此目的的小模块:extracting data out of HTML tables。在命令行中运行
pip install html-table-extractor
获得页面的原始HTML之后(您还将需要requests
),处理表并清除重复的条目:
import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor
pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')
# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]
# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()
# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)
## we need to consolidate some columns
# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]
# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
res = []
for line in list_of_lines[names_line+1:]: # first 2 lines are empty, third is column names
joined = [] # <- this list will accumulate *unique* values to become a single cell
for idx in col_names[k]:
el = line[idx]
if joined and joined[-1]==el: # if value already exist, skip
continue
joined.append(el) # add unique value to cell
res.append(''.join(joined)) # add cell to column
storage[k] = res # add column to storage
df = pd.DataFrame(storage)
print(df)
这将产生以下结果,非常接近原始结果:
Q1`17 Q2`17 Q3`17 Q4`17 FY 2017 Q1`18
0 (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands)
1 (Unaudited) (Unaudited) (Unaudited) (Unaudited) (Unaudited) (Unaudited)
2 Customer metrics
3 Customer accounts (1) 57,000+ 61,000+ 65,000+ 70,000+ 70,000+ 74,000+
4 Customer accounts added in period (1) 3,300+ 4,000+ 4,100+ 4,700+ 16,100+ 3,900+
5 Deals greater than $100,000 (2) 294 372 337 590 1,593 301
6 Customer accounts that purchased greater than $1 million during the quarter (1,2) 10 15 13 27 13
7
8 Annual recurring revenue metrics
9 Total annual recurring revenue (3) $439,001 $483,578 $526,211 $596,244 $596,244 $641,946
10 Subscription annual recurring revenue (4) $71,950 $103,538 $139,210 $195,488 $195,488 $237,533
11
12 Geographic revenue metrics - ASC 606
13 United States and Canada — — — — — $167,799
14 International — — — — — $78,408
.. ... ... ... ... ... ... ...
23
24 Additional revenue metrics - ASC 606
25 Remaining performance obligations (5) — — — — $99,580 $114,523
26
27 Additional revenue metrics - ASC 605
28 Ratable revenue as % of total revenue (6) 54% 56% 63% 60% 59% 72%
29 Ratable license revenue as % of total license revenue (7) 19% 23% 34% 34% 28% 54%
30 Services revenues as a % of maintenance and services revenue (8) 12% 13% 12% 13% 13% 11%
31
32 Bookings metrics - ASC 605
33 Ratable bookings as % of total bookings (2) 55% 61% 65% 70% 64% 72%
34 Ratable license bookings as % of total license bookings (2) 26% 37% 45% 51% 41% 59%
35
36 Other metrics
37 Worldwide employees 3,193 3,305 3,418 3,489 3,489 3,663