Question

我将大型CSV（包含股票财务数据）文件拆分为较小的块。 CSV文件的格式不同。类似于Excel数据透视表的东西。第一列的前几行包含一些标题。

以下列重复公司名称，ID等。因为一个公司有一个以上的属性，而不是一个公司只有一个列。

在前几行之后，列开始类似于典型的数据框，其中标题是列而不是行。

无论如何，我尝试做的是让Pandas允许重复的列标题，而不是让它添加＆＃34; .1＆＃34;，＆＃34; .2＆＃34;，＆＃34 ;标题之后的.3＆＃34;等。我知道Pandas本身不允许这样做，是否有解决方法？我试图在read_csv上设置header = None但它会抛出一个令牌化错误，我认为这是有道理的。我无法想到一个简单的方法。

import pandas as pd

csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"

#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))

filename = 1

#column increment
x = 30 * 59

for column in df:
    loc = df.columns.get_loc(column)
    if loc == (x * filename) + 1:
        y = filename - 1
        a = (x * y) + 1
        b = (x * filename) + 1
        date_df = df.iloc[:, :1]
        out_df = df.iloc[:, a:b]
        final_df = pd.concat([date_df, out_df], axis=1, join='inner')
        out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
        final_df.to_csv(out_path, index=False)
        #out_df.to_csv(out_path)
        filename += 1

# This should be the same as df, but with only the first column.
# Check it with similar code to above.

修改

从，https://github.com/pandas-dev/pandas/issues/19383，我添加：

        final_df.columns = final_df.iloc[0]
        final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)

所以，完整代码：

import pandas as pd

csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"

#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))

filename = 1

#column increment
x = 30 * 59

for column in df:
    loc = df.columns.get_loc(column)
    if loc == (x * filename) + 1:
        y = filename - 1
        a = (x * y) + 1
        b = (x * filename) + 1
        date_df = df.iloc[:, :1]
        out_df = df.iloc[:, a:b]
        final_df = pd.concat([date_df, out_df], axis=1, join='inner')
        out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
        final_df.columns = final_df.iloc[0]
        final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
        final_df.to_csv(out_path, index=False)
        #out_df.to_csv(out_path)
        filename += 1

# This should be the same as df, but with only the first column.
# Check it with similar code to above.

现在，整个第一行都消失了。但是，预期的输出是用标题行替换为重置索引，没有＆＃34; .1＆＃34;，＆＃34; .2＆＃34;等等。

截图：

SimFin ID行不再存在。

Answer 1

我就这样做了：

    final_df.columns = final_df.columns.str.split('.').str[0]

参考： https://pandas.pydata.org/pandas-docs/stable/text.html

Answer 2

以下解决方案将确保数据框中其他带有符号句点（'。'）的列名不会被修改

import pandas as pd
from csv import DictReader

csv_file_loc = "file.csv"

# Read csv  
df = pd.read_csv(csv_file_loc)

# Get column names from csv file using DictReader  
col_names = DictReader(open(csv_file_loc, 'r')).fieldnames

# Rename columns  
df.columns = col_names

在Pandas

2 个答案: