Question

我正在尝试自动将数百个excel文件读取到单个数据框中。幸运的是，excel文件的布局是相当恒定的。它们都有相同的标题（标题的大小写可能会有所不同），然后当然是相同的列数，并且我要读取的数据始终存储在第一个电子表格中。

但是，在某些文件中，实际数据开始之前已经跳过了许多行。实际数据之前的行中可能有也可能没有注释。例如，在某些文件中，标题位于第3行，然后数据从第4行开始向下。

我希望pandas自己找出要跳过的行数。当前我使用的是一个稍微复杂的解决方案...我首先将文件读入数据帧，检查标题是否正确，是否没有搜索找到包含标题的行，然后重新读取文件，现在知道有多少行跳过。

def find_header_row(df, my_header):
    """Find the row containing the header."""
    for idx, row in df.iterrows():
        row_header = [str(t).lower() for t in row]
        if len(set(my_header) - set(row_header)) == 0:
            return idx + 1
    raise Exception("Cant find header row!")

my_header = ['col_1', 'col_2',..., 'col_n']
df = pd.read_excel('my_file.xlsx')
# Make columns lower case (case may vary)
df.columns = [t.lower() for t in df.columns]

# Check if the header of the dataframe mathces my_header
if len(set(my_header) - set(df.columns)) != 0:
    # If no... use my function to find the row containing the header
    n_rows_to_skip = find_header_row(df, kolonner)
    # Re-read the dataframe, skipping the right number of rows
    df = pd.read_excel(fil, skiprows=n_rows_to_skip)

由于我知道标题行的样子，有没有办法让pandas自己弄清楚数据的开始位置？还是有人可以想到更好的解决方案？

Answer 1

让我们知道这是否对您有用

kubectl set image deployment/myapp myapp=repo.mycompany.com/myapp/ui:beta.119

让Pandas找出pd.read_excel中要跳过多少行

1 个答案: