Question

我有一个csv文件，其中包含一个包含年，月，日，小时的日期列。我试图创建一个新的csv文件，其中包含第一个文件中最大值和最小值之间所有日期的一列，以及第二个列，其中包含日期显示次数的计数。例如：

join

会变成

file 1:
2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57

我可以使用计数器来提取日期并制作日期及其出现的字典，我猜测我必须使用datetime按小时在字典中创建从max到min的列表，然后以某种方式将计数分配给第二个列表。这将是一个非常大的数据集。

非常感谢任何帮助。

Answer 1

只需使用与您的问题相关联的标记，我就会提供一个使用Counter，datetime和好的问题的解决方案。 csv：

from collections import Counter
from datetime import datetime
import csv


with open('file2.txt','w') as outfile:
    csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")
    data = Counter([datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M').strftime('%Y-%m-%d-%H') for x in open('file1.txt')]).items()
    data = sorted(data, key = lambda x: x[0])
    csv_writer.writerows(data)

这会产生一个包含以下内容的文件：

2016-01-03-05   1
2016-01-03-07   1
2016-02-18-23   2

编辑：

第二个想法，我想我可能有点误解了这个问题。在我看来，您希望将原始文件中添加的一些日期添加到输出文件中，其计数为零。我认为以下内容应该更具包容性：

from collections import Counter
from datetime import datetime, timedelta
import csv


with open('file2.txt','w') as outfile:
    csv_writer = csv.writer(outfile, delimiter = "\t", lineterminator = "\n")

    # Get each row and convert it to datetime
    # Get the minimum and maximum values
    datetimes = [datetime.strptime(x.strip(),'%Y-%m-%d-%H:%M') for x in open('file1.txt')]
    min_date = min(datetimes)

    # Get the number of hours between min and max dates
    num_hours = (max(datetimes) - min_date).seconds//3600 + 24 * (max(datetimes) - min_date).days

    # Convert to desired date format
    datetimes = [x.strftime('%Y-%m-%d-%H') for x in datetimes]

    # Count the values
    data = Counter(datetimes).items()

    # Add the mising days from the original file
    for i in range(num_hours):
        if (min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H') not in datetimes:
            data.append(((min_date + timedelta(hours = i)).strftime('%Y-%m-%d-%H'), 0))

    # Sort by dates
    data = sorted(data, key = lambda x: x[0])

    # Output the data into file2.txt
    csv_writer.writerows(data)

这个应该产生：

2016-01-03-05   1
2016-01-03-06   0
2016-01-03-07   1
2016-01-03-08   0
2016-01-03-09   0
2016-01-03-10   0
...
2016-02-18-21   0
2016-02-18-22   0
2016-02-18-23   2

我希望这证明有用。

Answer 2

这是大熊猫的解决方案。

import pandas as pd                                                                                                                                                                            
df=pd.read_csv("file1",sep=":",names=['v'])                                                                                                                                                    
df.index=pd.to_datetime(df.index)                                                                                                                                                              
df.groupby(pd.TimeGrouper('H')).size().to_csv("file2")

输出文件如下所示，

2016-01-03 05:00:00,1
2016-01-03 06:00:00,0
2016-01-03 07:00:00,1
2016-01-03 08:00:00,0
...
2016-02-18 19:00:00,0
2016-02-18 20:00:00,0
2016-02-18 21:00:00,0
2016-02-18 22:00:00,0
2016-02-18 23:00:00,2

Answer 3

我认为你可以使用正则表达式：

import re

regex = re.compile(r'^\d{4}-\d{2}-\d{2}-\d{2}:\d{2}$')
stamps = {}

with open('file1.csv', 'r') as input_file:
    lines = input_file.read().splitlines()

for line in lines:
    if regex.search(line):
        elements = line.split('-')
        elements.extend(elements.pop().split(':'))
        key = elements[0] + '-' + elements[1] + '-' + elements[2] + '-' + elements[3]
        stamps.setdefault(key, 0)
        stamps[key] += 1

with open('file2.csv','w') as output_file:
    for key, value in sorted(stamps.items()):
        output_file.write(key + '\t' + str(value) + '\n')

file1.csv

2016-02-18-23:19
2016-02-18-23:45
2016-01-03-05:12
2016-01-03-07:57

file2.csv

2016-01-03-05 1
2016-01-03-07 1
2016-02-18-23 2

CSV计数重复

3 个答案:

编辑：