熊猫read_html读取之前或之后清理

时间:2018-07-15 14:11:22

标签: python html pandas

我正在尝试将此html中的最后一个表转换为数据表。

代码如下:

import numpy as np

# convert series to datetime if necessary
for col in ['deadline', 'delivered']:
    df1[col] = pd.to_datetime(df1[col], dayfirst=True)

for col in ['deadline', 'delivered']:
    df2[col] = pd.to_datetime(df2[col], dayfirst=True)

# create series mapping key to delivered date in df1
s = df1.set_index('key')['delivered']

# define conditions and values
conditions = [~df2['key'].isin(s.index), df2['key'].map(s) <= df2['deadline']]
values = [np.nan, 'project delivered before deadline']

# apply conditions and values, with fallback value
df2['In Project1'] = np.select(conditions, values, 'Project delayed')

print(df2)

   key name   deadline  delivered                        In Project1
0  AA1  Tom 2018-05-01 2018-04-30                    Project delayed
1  AA2  Sue 2018-05-01 2018-04-30  project delivered before deadline
2  AA3  Jim 2018-05-01 2018-05-03                                nan

如您所见,它已将其读入,但需要清理。我的问题是给有使用此功能经验的人的。最好先读一遍,然后再尝试对其进行清理?如果有人知道该怎么做,请发布一些代码。谢谢。

2 个答案:

答案 0 :(得分:1)

下面的

Code使用pd.read_html()从网站中提取表格。可以根据table format进一步调整其他参数。

# Import libraries
import pandas as pd

# Read table
link = 'https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm'
a=pd.read_html(link, header=None, skiprows=1)

# Save the dataframe
df = a[23]

# Remove NaN rows/columns
col_list = data.iloc[1]
df = data.loc[4:,[0,1,3,5,7,9,11]] # adjusted column names 
df.columns =  col_list[:len(df.columns)]
df.head(7)

注意:原始表格中的空白单元格被NaN代替

enter image description here

网站原始表中的前几行: enter image description here

答案 1 :(得分:1)

总是最好清除原始数据,因为任何处理都可能引入伪像。您的HTML表是使用span功能创建的,因此,如果在HTML解析后清除DataFrame,就无法以通用方式提取数据。因此,我建议您安装一个专门用于此目的的小模块:extracting data out of HTML tables。在命令行中运行

pip install html-table-extractor 

获得页面的原始HTML之后(您还将需要requests),处理表并清除重复的条目:

import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor

pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)

# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')

# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]

# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()

# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)

## we need to consolidate some columns

# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
    col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]

# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
    res = []
    for line in list_of_lines[names_line+1:]:  # first 2 lines are empty, third is column names
        joined = [] # <- this list will accumulate *unique* values to become a single cell
        for idx in col_names[k]:
            el = line[idx]
            if joined and joined[-1]==el:   # if value already exist, skip
                continue
            joined.append(el)   # add unique value to cell
        res.append(''.join(joined))   # add cell to column
    storage[k] = res   # add column to storage
df = pd.DataFrame(storage)
print(df)

这将产生以下结果,非常接近原始结果:

                                                                                                        Q1`17                   Q2`17                   Q3`17                   Q4`17                 FY 2017                   Q1`18
0                                                                                      (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)
1                                                                                                 (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)
2                                                                    Customer metrics                                                                                                                                                
3                                                               Customer accounts (1)                 57,000+                 61,000+                 65,000+                 70,000+                 70,000+                 74,000+
4                                               Customer accounts added in period (1)                  3,300+                  4,000+                  4,100+                  4,700+                 16,100+                  3,900+
5                                                     Deals greater than $100,000 (2)                     294                     372                     337                     590                   1,593                     301
6   Customer accounts that purchased greater than $1 million during the quarter (1,2)                      10                      15                      13                      27                                              13
7                                                                                                                                                                                                                                    
8                                                    Annual recurring revenue metrics                                                                                                                                                
9                                                  Total annual recurring revenue (3)                $439,001                $483,578                $526,211                $596,244                $596,244                $641,946
10                                          Subscription annual recurring revenue (4)                 $71,950                $103,538                $139,210                $195,488                $195,488                $237,533
11                                                                                                                                                                                                                                   
12                                               Geographic revenue metrics - ASC 606                                                                                                                                                
13                                                           United States and Canada                       —                       —                       —                       —                       —                $167,799
14                                                                      International                       —                       —                       —                       —                       —                 $78,408
..                                                                                ...                     ...                     ...                     ...                     ...                     ...                     ...
23                                                                                                                                                                                                                                   
24                                               Additional revenue metrics - ASC 606                                                                                                                                                
25                                              Remaining performance obligations (5)                       —                       —                       —                       —                 $99,580                $114,523
26                                                                                                                                                                                                                                   
27                                               Additional revenue metrics - ASC 605                                                                                                                                                
28                                          Ratable revenue as % of total revenue (6)                     54%                     56%                     63%                     60%                     59%                     72%
29                          Ratable license revenue as % of total license revenue (7)                     19%                     23%                     34%                     34%                     28%                     54%
30                   Services revenues as a % of maintenance and services revenue (8)                     12%                     13%                     12%                     13%                     13%                     11%
31                                                                                                                                                                                                                                   
32                                                         Bookings metrics - ASC 605                                                                                                                                                
33                                        Ratable bookings as % of total bookings (2)                     55%                     61%                     65%                     70%                     64%                     72%
34                        Ratable license bookings as % of total license bookings (2)                     26%                     37%                     45%                     51%                     41%                     59%
35                                                                                                                                                                                                                                   
36                                                                      Other metrics                                                                                                                                                
37                                                                Worldwide employees                   3,193                   3,305                   3,418                   3,489                   3,489                   3,663