def main():
bulletins = os.listdir(INPUT_DATA_DIR)
df = pd.DataFrame(bulletins)
df.columns = ['html']
df['html'] = df.html.apply(read_file)
df['id'] = df.html.apply(get_document_id)
df['res_html'] = df.html.apply(get_resolution)
df['type'] = df.res_html.apply(get_type)
print(df.head())
if __name__ == "__main__":
main()
此代码创建下表:
html ... type
0 <!DOCTYPE html><html xmlns:msxsl="urn:schemas-... ... Text
1 <!DOCTYPE html><html xmlns:msxsl="urn:schemas-... ... Table
2 <!DOCTYPE html><html xmlns:msxsl="urn:schemas-... ... Table
3 <!DOCTYPE html><html xmlns:msxsl="urn:schemas-... ... Text
4 <!DOCTYPE html><html xmlns:msxsl="urn:schemas-... ... Table
“ res_html”列包含html代码。 “类型”列包含有关上一列中的代码是否包含表的信息。如果存在,则“类型”列包含值“表”。如果没有表,则输入值“文本”。我确保正确填写了“类型”列。
接下来,我必须添加“行”列。对于“类型”列等于“表格”的情况,必须在“行”列中输入新值。
def main():
bulletins = os.listdir(INPUT_DATA_DIR)
df = pd.DataFrame(bulletins)
df.columns = ['html']
df['html'] = df.html.apply(read_file)
df['id'] = df.html.apply(get_document_id)
df['res_html'] = df.html.apply(get_resolution)
df['type'] = df.res_html.apply(get_type)
print(df.head())
row_index = df.index[df['type'] == 'Table'].tolist()
df.loc[row_index, 'row'] = df.res_html.apply(get_type_table)
def get_type_table(tree):
tbody = tree.find('tbody')
row = tbody.find('tr')
if row:
return 'tr'
return ''
if __name__ == "__main__":
main()
在这个阶段我有一个问题:
Traceback (most recent call last):
File "/home/roman/etlsrc/parsers/hp_ux/app/resolution_field.py", line 85, in <module>
main()
File "/home/roman/etlsrc/parsers/hp_ux/app/resolution_field.py", line 25, in main
df.loc[row_index, 'row'] = df.res_html.apply(get_type_table)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "/home/roman/etlsrc/parsers/hp_ux/app/resolution_field.py", line 63, in get_type_table
row = tbody.find('tr')
AttributeError: 'NoneType' object has no attribute 'find'
这是因为“ get_type_table
”功能已应用于我的DataFrame中的所有行
我需要做些什么才能使此功能仅适用于“类型”列中包含“表”值的行?
答案 0 :(得分:0)
问题是,当您调用get_type_table
时,您正在跳过tree
参数,并且可能应该是:
df.loc[row_index, 'row'] = df.loc[row_index, :].res_html.apply(get_type_table(tree))
.loc[row_index, :]
之后的=
是您所缺少的,因此不会在所有行上调用get_type_table
。