将值添加到新列中的特定行。熊猫

时间:2020-07-24 23:10:41

标签: python-3.x pandas beautifulsoup

def main():
    bulletins = os.listdir(INPUT_DATA_DIR)

    df = pd.DataFrame(bulletins)
    df.columns = ['html']
    df['html'] = df.html.apply(read_file)
    df['id'] = df.html.apply(get_document_id)
    df['res_html'] = df.html.apply(get_resolution)
    df['type'] = df.res_html.apply(get_type)
    print(df.head())

  
if __name__ == "__main__":
    main()

此代码创建下表:

                                                html  ...   type
0  <!DOCTYPE html><html xmlns:msxsl="urn:schemas-...  ...   Text
1  <!DOCTYPE html><html xmlns:msxsl="urn:schemas-...  ...  Table
2  <!DOCTYPE html><html xmlns:msxsl="urn:schemas-...  ...  Table
3  <!DOCTYPE html><html xmlns:msxsl="urn:schemas-...  ...   Text
4  <!DOCTYPE html><html xmlns:msxsl="urn:schemas-...  ...  Table

“ res_html”列包含html代码。 “类型”列包含有关上一列中的代码是否包含表的信息。如果存在,则“类型”列包含值“表”。如果没有表,则输入值“文本”。我确保正确填写了“类型”列。

接下来,我必须添加“行”列。对于“类型”列等于“表格”的情况,必须在“行”列中输入新值。

def main():
    bulletins = os.listdir(INPUT_DATA_DIR)

    df = pd.DataFrame(bulletins)
    df.columns = ['html']
    df['html'] = df.html.apply(read_file)
    df['id'] = df.html.apply(get_document_id)
    df['res_html'] = df.html.apply(get_resolution)
    df['type'] = df.res_html.apply(get_type)
    print(df.head())

    row_index = df.index[df['type'] == 'Table'].tolist()
    df.loc[row_index, 'row'] = df.res_html.apply(get_type_table)

def get_type_table(tree):
    tbody = tree.find('tbody')
    row = tbody.find('tr')

    if row:
        return 'tr'
    return ''

if __name__ == "__main__":
    main()

在这个阶段我有一个问题:

Traceback (most recent call last):
  File "/home/roman/etlsrc/parsers/hp_ux/app/resolution_field.py", line 85, in <module>
    main()
  File "/home/roman/etlsrc/parsers/hp_ux/app/resolution_field.py", line 25, in main
    df.loc[row_index, 'row'] = df.res_html.apply(get_type_table)
  File "/usr/local/lib/python3.7/dist-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
  File "/home/roman/etlsrc/parsers/hp_ux/app/resolution_field.py", line 63, in get_type_table
    row = tbody.find('tr')
AttributeError: 'NoneType' object has no attribute 'find'

这是因为“ get_type_table”功能已应用于我的DataFrame中的所有行

我需要做些什么才能使此功能仅适用于“类型”列中包含“表”值的行?

1 个答案:

答案 0 :(得分:0)

问题是,当您调用get_type_table时,您正在跳过tree参数,并且可能应该是:

df.loc[row_index, 'row'] = df.loc[row_index, :].res_html.apply(get_type_table(tree)) 

.loc[row_index, :]之后的=是您所缺少的,因此不会在所有行上调用get_type_table