我试图创建一个网络抓取工具,从多个页面获取类似表格(具有不同的值)的信息(尽管在我的情况下它会删除存储在文本文档中的html)
以下代码是我程序中的一个片段,其中的数据框是从先前制作的列表和标题中创建的。然后它打开一个新的excel工作簿,并将数据框写入具有指定名称的工作表中的工作簿(随每次迭代而变化)
import os
from bs4 import BeautifulSoup # imports BeautifulSoup
import pandas # imports pandas
from pandas import ExcelWriter
#creates the lists
list_of_rows = []
list_of_lists = []
list_of_headers = []
for File in range(len(os.listdir())):
if os.listdir()[File].endswith('.txt'):
file = open(os.listdir()[File])
data = file.read()
file.close()
#Converts the text file into something the program can use
soup = BeautifulSoup(data,'lxml')
tables = soup.find_all(class_="overthrow table_container") #Creates a resutset that will show all of the tables with this class name
#grabs from the header
find_header = tables[2].thead
header = find_header.find_all("th")
#grabs from the table
find_table = tables[2].tbody #creates a tag element from the desired table and highlights the tbody section
rows = find_table.find_all("tr") #creates another resultset signle out the elements with a tr ta
#for loop that creates the list for the header
header_list = []
for i in range(len(header)):
if i < 14:
pass
else:
list_of_headers.insert(i,header[i].get_text())
#for loop that creates the lists for data frame table
for j in range(len(rows)):
row_finder = rows[j]
tag_row = row_finder.find_all("td")
for i in range(len(tag_row)):
list_of_rows.insert(i,tag_row[i].get_text())
list_of_lists.append(list_of_rows)
list_of_rows = []
#creates the DataFrame
df = pandas.DataFrame(list_of_lists,columns=list_of_headers)
writer = ExcelWriter('testing document.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name=os.listdir()[File])
writer.save()
df.drop(df.index)
print("worked once")
else:
pass
我收到以下错误代码。
worked once
Traceback (most recent call last):
File "test3.py", line 47, in <module>
df = pandas.DataFrame(list_of_lists,columns=list_of_headers)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 314, in __init__
arrays, columns = _to_arrays(data, columns, dtype=dtype)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 5617, in _to_arrays
dtype=dtype)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 5696, in _list_to_arrays
coerce_float=coerce_float)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 5755, in _convert_object_array
'columns' % (len(columns), len(content)))
AssertionError: 56 columns passed, passed data had 28 columns
所以在打印代码的第一行&#34之间;这个工作一次&#34;事实上,它确实创建了excel文件,我的猜测是问题是它没有创建新的工作表?这个代码根本无法完成我认为的那样。