Question

我有一个超过210000行的大型CSV文件。我是python和pandas的新手。我想有效地循环遍历timestamp列，将timestamp列拆分为2个新列（日期和时间），然后将新日期列格式化为％Y％m％d并删除新的时间列.ie只写回CSV文件新格式化日期列。你是怎么做到的？

输入文件样本：

   minit,timestamp,open,high,low,close
   0,2009-02-23 17:32:00,1.2708,1.2708,1.2706,1.2706
   1,2009-02-23 17:33:00,1.2708,1.2708,1.2705,1.2706
   2,2009-02-23 17:34:00,1.2706,1.2707,1.2702,1.2702
   3,2009-02-23 17:35:00,1.2704,1.2706,1.27,1.27
   4,2009-02-23 17:36:00,1.2701,1.2706,1.2698,1.2703
   5,2009-02-23 17:37:00,1.2703,1.2703,1.27,1.2702
   6,2009-02-23 17:38:00,1.2701,1.2701,1.2696,1.2697

输出文件样本：

   minit,date,open,high,low,close
   0,20090223,1.2708,1.2708,1.2706,1.2706
   1,20090223,1.2708,1.2708,1.2705,1.2706
   2,20090223,1.2706,1.2707,1.2702,1.2702
   3,20090223,1.2704,1.2706,1.27,1.27
   4,20090223,1.2701,1.2706,1.2698,1.2703
   5,20090223,1.2703,1.2703,1.27,1.2702
   6,20090223,1.2701,1.2701,1.2696,1.2697

我开始编写一个示例代码，用Google搜索后完成此操作：

     import csv
     import itertools
     import operator
     import time
     import datetime
     import pandas as pd
     from pandas import DataFrame, Timestamp
     from numpy import *

     def datestring_to_timestamp(str):
         return time.mktime(time.strptime(str, "%Y-%m-%d %H:%M:%S"))

     def timestamp_to_datestring(timestamp):
        return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(timestamp))

     def timestamp_to_float(str):
        return float(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))

     def timestamp_to_intstring(str):
        return datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s")

    def timestamp_to_int(str):
        return int(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))

    with open("inputfile.csv", 'rb') as input, open('outputfile.csv', 'wb') as output:
       reader = csv.reader(input, delimiter = ',')
       writer = csv.writer(output, delimiter = ',')

    # Need to process loop or process the timestamp column

Answer 1

您可以在to_csv的参数中指定日期格式字符串，它会输出您喜欢的日期，无需提取/转换/添加新列等。

所以使用read_csv加载数据：

df = pd.read_csv('mydata.csv', parse_dates=['timestamp']

In [15]:

df
Out[15]:
   minit           timestamp    open    high     low   close
0      0 2009-02-23 17:32:00  1.2708  1.2708  1.2706  1.2706
1      1 2009-02-23 17:33:00  1.2708  1.2708  1.2705  1.2706
2      2 2009-02-23 17:34:00  1.2706  1.2707  1.2702  1.2702
3      3 2009-02-23 17:35:00  1.2704  1.2706  1.2700  1.2700
4      4 2009-02-23 17:36:00  1.2701  1.2706  1.2698  1.2703
5      5 2009-02-23 17:37:00  1.2703  1.2703  1.2700  1.2702
6      6 2009-02-23 17:38:00  1.2701  1.2701  1.2696  1.2697

如果你想在这个阶段你可以重命名列，我们可以传递param date_format='%Y%m%d' to to_csv`，这只会将日期部分输出到csv，我们可以重新加载它并显示它保存的内容：

In [19]:

df.rename(columns={'timestamp':'date'},inplace=True)
df.to_csv(r'c:\data\date.csv', date_format='%Y%m%d')
df1 = pd.read_csv(r'C:\data\date.csv', index_col=[0])
df1
Out[19]:
   minit      date    open    high     low   close
0      0  20090223  1.2708  1.2708  1.2706  1.2706
1      1  20090223  1.2708  1.2708  1.2705  1.2706
2      2  20090223  1.2706  1.2707  1.2702  1.2702
3      3  20090223  1.2704  1.2706  1.2700  1.2700
4      4  20090223  1.2701  1.2706  1.2698  1.2703
5      5  20090223  1.2703  1.2703  1.2700  1.2702
6      6  20090223  1.2701  1.2701  1.2696  1.2697

使用python和pandas将时间戳列拆分为CSV中的两个新列

1 个答案: