我正在尝试根据一系列12个单独的CSV(一年中要合并的12个月)创建一个合并的数据框。所有CSV都具有相同的格式和列布局。
当我第一次运行它时,它似乎可以运行,并且剩下一个包含6列的组合数据框(如预期的那样)。进行查看后,我发现标题行已作为所有文件中的实际数据应用,因此我需要消除一些不良行。我可以手动进行这些更改,但是我希望代码能够自动处理。
因此,为此,我更新了代码,以使其仅在具有标题的第一个CSV中读取,而在没有标题的其余CSV中读取,并将所有内容连接在一起。 BUT 这似乎可行CSV,这显然不是我想要的(请参见下图)。
代码是相似的,我只对第一个CSV后的11个CSV使用header=None
中的pd.read_csv()
参数(对于第一个CSV我不使用该参数)。谁能给我一个提示,为什么我在运行此代码时为什么要获得12列(数据位置如上所述)? CSV文件的布局如下所示。
感谢任何帮助。
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
dfSD = pd.concat([dfSD, df])
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))
答案 0 :(得分:0)
以下是一个简单的解决方案。
import pandas as pd
import numpy as np
import os
totrows = 0
files = os.listdir('c:/data/datasets')
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
columns = []
dfSD = []
for file in files:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
if not columns:
columns = df.columns
df.columns = columns
dfSD.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat(dfSD, axis = 0)
dfSD = dfSD.reset_index(drop = True)
另一种可能性是:
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
df.columns = dfSD.columns
df_comb.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)