Question

我正在尝试根据一系列12个单独的CSV（一年中要合并的12个月）创建一个合并的数据框。所有CSV都具有相同的格式和列布局。

当我第一次运行它时，它似乎可以运行，并且剩下一个包含6列的组合数据框（如预期的那样）。进行查看后，我发现标题行已作为所有文件中的实际数据应用，因此我需要消除一些不良行。我可以手动进行这些更改，但是我希望代码能够自动处理。

因此，为此，我更新了代码，以使其仅在具有标题的第一个CSV中读取，而在没有标题的其余CSV中读取，并将所有内容连接在一起。 BUT 这似乎可行CSV，这显然不是我想要的（请参见下图）。

代码是相似的，我只对第一个CSV后的11个CSV使用header=None中的pd.read_csv()参数（对于第一个CSV我不使用该参数）。谁能给我一个提示，为什么我在运行此代码时为什么要获得12列（数据位置如上所述）？ CSV文件的布局如下所示。

感谢任何帮助。

import pandas as pd
import numpy as np
import os

# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0

# Get list of csv files to read.
files = os.listdir('c:/data/datasets')

# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)

# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
    df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
    dfSD = pd.concat([dfSD, df])
    totrows += df.shape[0]
    print(file + " == " + str(df.shape[0]) + " rows")               

print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))

Answer 1

以下是一个简单的解决方案。

import pandas as pd
import numpy as np
import os

totrows = 0

files = os.listdir('c:/data/datasets')

dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)

columns = []
dfSD = []
for file in files:
    df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
    if not columns:
        columns = df.columns
    df.columns = columns

    dfSD.append(df)

    totrows += df.shape[0]
    print(file + " == " + str(df.shape[0]) + " rows")               

dfSD = pd.concat(dfSD, axis = 0)

dfSD = dfSD.reset_index(drop = True)

另一种可能性是：

import pandas as pd
import numpy as np
import os

# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0

# Get list of csv files to read.
files = os.listdir('c:/data/datasets')

# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
    df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)

    df.columns = dfSD.columns
    df_comb.append(df)
    totrows += df.shape[0]
    print(file + " == " + str(df.shape[0]) + " rows")

dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)

连接数据框以添加其他列

1 个答案: