如何使用pandas将特定列设置为int类型

时间:2017-09-12 13:23:08

标签: python excel python-3.x pandas csv

我有这个脚本用于将一些csv文件写入文件夹中的excel:

from pandas.io.excel import ExcelWriter
import pandas
import os

path = 'data/'
ordered_list = sorted(os.listdir(path), key = lambda x: int(x.split(".")[0]))


with ExcelWriter('my_excel.xlsx') as ew:
    for csv_file in ordered_list:
        pandas.read_csv(path + csv_file).to_excel(ew, index = False, sheet_name=csv_file[:-4], encoding='utf-8')

现在我的问题是,所有列(让我们说G:H)都是字符串格式(例如' 400或' 10),其中包含'之前,我认为它们是字符串因为csv将它们转换为字符串,我需要它们是int,我怎么能使G:H int ?!我用python 3,谢谢!

PS(这是一个csv样本):

ANPIS,,,,,,,
AGENTIA JUDETEANA PENTRU PLATI SI INSPECTIE SOCIALA TIMIS,,,,,,,
,,,,,,,
Macheta Comparativa CREDITORI - numai pentru Beneficiile a caror Evidenta se tine si in Contabilitate si in aplicatia SAFIR,,,,,,,
Situatie ANALITICA - NOMINAL la 30.06.2017,,,,,,,
1. ALOCATIA DE STAT PENTRU COPII,,,,,,,
Nr. Benef,Nume Prenume,CNP,Data Constituirii,Suma Contabilitate,Suma SAFIR,Differenta Suma,Explicatii daca exista diferente
1,2,3,4,5,6,7=5-6,8
1,CAZACU MIHAI,133121140,Aug 2016,84,84
2,NICOARA PETRU,143152638,"Aug 2014, Sept 2014",126,84
3,CERNEA NICOLAE DAN,143354723,Dec 2015,84,84
4,LUDWIG PETRU,144091376,Nov 2014,42,42
5,POPA REMUS,1440915363,Iun 2015,84,84
6,BOGDAN MARCEL,144154726,"Feb 2015, Apr 2015, Sept 2015, Oct 2015, Feb 2016",336,336
7,HENDRE AUGUSTIN,145054704,Feb 2015,42,42
8,COJOC VASILE,147050307,"Sept 2014, Oct 2014",84,84
9,RADULESCU VICTOR,147352628,"Sept 2014, Oct 2014, Nov 2014, Dec 2014",168,168
10,RADAU DUMITRU,148054764,"Feb 2017, Mar 2017",168,168
11,COVACIU PETRU,148054802,Iun 2016,84,84
12,BOT IOAN,14808634,"Aug 2014, Sept 2014, Oct 2014, Nov 2014",168,168

^^头是这个:

ANPIS,,,,,,,
AGENTIA JUDETEANA PENTRU PLATI SI INSPECTIE SOCIALA TIMIS,,,,,,,
,,,,,,,
Macheta Comparativa CREDITORI - numai pentru Beneficiile a caror Evidenta se tine si in Contabilitate si in aplicatia SAFIR,,,,,,,
Situatie ANALITICA - NOMINAL la 30.06.2017,,,,,,,
1. ALOCATIA DE STAT PENTRU COPII,,,,,,,
Nr. Benef,Nume Prenume,CNP,Data Constituirii,Suma Contabilitate,Suma SAFIR,Differenta Suma,Explicatii daca exista diferente
1,2,3,4,5,6,7=5-6,8

1 个答案:

答案 0 :(得分:4)

您可以两次阅读每个文件 - 仅使用参数nrows读取第一个标题,然后使用skiprows读取正文。

然后也需要写两次。

解决方案有点复杂,因为pandas错误的解析数据 - 不支持具有8级别的MulttiIndex。如果没有设置标题,则标题中的数据将与主体连接,输出将变得混乱。

with ExcelWriter('my_excel.xlsx') as ew:
    for csv_file in ordered_list:
        df1 = pandas.read_csv(path + csv_file, nrows=8, header=None)
        df2 = pandas.read_csv(path + csv_file, skiprows=8, header=None)
        df1.to_excel(ew, index = False, sheet_name=csv_file[:-4], encoding='utf-8', header=False)
        row = len(df1.index)
        df2.to_excel(ew, index = False, sheet_name=csv_file[:-4], encoding='utf-8', startrow=row , startcol=0, header=False)

使用applystrip移除',然后astype转换为int

cols = ['G','H']

with ExcelWriter('my_excel.xlsx') as ew:
    for csv_file in ordered_list:
        df = pandas.read_csv(path + csv_file)
        df[cols] = df[cols].astype(str).apply(lambda x: x.str.strip("'")).astype(int)
        print (df.head())
        df.to_excel(ew, index = False, sheet_name=csv_file[:-4], encoding='utf-8')

另一个解决方案是使用参数converters和自定义函数:

cols = ['G','H']

def converter(x):
    return int(x.strip("'"))
#define each column
converters={x:converter for x in cols}

with ExcelWriter('my_excel.xlsx') as ew:
    for csv_file in ordered_list:
        df = pandas.read_csv(path + csv_file, converters=converters)
        print (df.head())
        df.to_excel(ew, index = False, sheet_name=csv_file[:-4], encoding='utf-8')