Question

我试图创建一个网络抓取工具，从多个页面获取类似表格（具有不同的值）的信息（尽管在我的情况下它会删除存储在文本文档中的html）

以下代码是我程序中的一个片段，其中的数据框是从先前制作的列表和标题中创建的。然后它打开一个新的excel工作簿，并将数据框写入具有指定名称的工作表中的工作簿（随每次迭代而变化）

import os
from bs4 import BeautifulSoup # imports BeautifulSoup
import pandas # imports pandas
from pandas import ExcelWriter

#creates the lists
list_of_rows = []
list_of_lists = []
list_of_headers = []

for File in range(len(os.listdir())):
    if os.listdir()[File].endswith('.txt'):
        file = open(os.listdir()[File])
        data = file.read()
        file.close()

        #Converts the text file into something the program can use
        soup = BeautifulSoup(data,'lxml')
        tables = soup.find_all(class_="overthrow table_container") #Creates a resutset that will show all of the tables with this class name

        #grabs from the header
        find_header = tables[2].thead
        header = find_header.find_all("th")

        #grabs from the table
        find_table = tables[2].tbody #creates a tag element from the desired table and highlights the tbody section
        rows = find_table.find_all("tr") #creates another resultset signle out the elements with a tr ta

        #for loop that creates the list for the header
        header_list = []
        for i in range(len(header)):
            if i < 14:
                pass
            else:
                list_of_headers.insert(i,header[i].get_text())

        #for loop that creates the lists for data frame table
        for j in range(len(rows)):
            row_finder = rows[j]
            tag_row = row_finder.find_all("td")
            for i in range(len(tag_row)):
                list_of_rows.insert(i,tag_row[i].get_text())
            list_of_lists.append(list_of_rows)
            list_of_rows = []

        #creates the DataFrame
        df = pandas.DataFrame(list_of_lists,columns=list_of_headers)

        writer = ExcelWriter('testing document.xlsx', engine='xlsxwriter')
        df.to_excel(writer,sheet_name=os.listdir()[File])
        writer.save()
        df.drop(df.index)
        print("worked once")

    else:
        pass

我收到以下错误代码。

worked once
Traceback (most recent call last):
  File "test3.py", line 47, in <module>
    df = pandas.DataFrame(list_of_lists,columns=list_of_headers)
  File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 314, in __init__
    arrays, columns = _to_arrays(data, columns, dtype=dtype)
  File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 5617, in _to_arrays
    dtype=dtype)
  File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 5696, in _list_to_arrays
    coerce_float=coerce_float)
  File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 5755, in _convert_object_array
    'columns' % (len(columns), len(content)))
AssertionError: 56 columns passed, passed data had 28 columns

所以在打印代码的第一行＆＃34之间;这个工作一次＆＃34;事实上，它确实创建了excel文件，我的猜测是问题是它没有创建新的工作表？这个代码根本无法完成我认为的那样。

将新的pandas数据帧迭代到同一工作簿中的新Excel工作表

0 个答案: