Question

我有一个使用这种语法的日志（文本）文件

1/21/18, 22:48 - ~text~
1/21/18, 22:48 - ~text~
1/23/18, 22:48 - ‪~text~
~text~
~text~
1/24/18, 22:48 - ~text~

我想获取所有日期的数组，例如["1/21/18","1/21/18","1/23/18","1/24/18"]

因为我的最终目标是建立每个日期的频率直方图，以了解每天发生的事件数量（只是了解事件随时间的变化）（因此，如果您想给小费，这更容易，值得欢迎！）

我已经根据问题4709652尝试过正则表达式，但是那不能按预期工作。无论如何，我的问题之一是文本文件很大（数百兆字节），这会导致速度变慢。

实现此目的的最佳方法是什么？

谢谢！

Answer 1

@Patrick的建议是，熊猫将是一种更简单有效的方式。

import pandas as pd
p = pd.read_csv(<name of the file>,names=["date","random"])
p['date'] = pd.to_datetime(p['date'],errors='coerce') #converts the first column to date type and puts a NaT in place of texts.
p = p.dropna() #drop rows containing NaT
print(p['date'])

输出：

0   2018-01-21
1   2018-01-21
2   2018-01-23
5   2018-01-24

如果日期列忽略NaT而不删除它们，您甚至可以将日期列传递给直方图函数。

Answer 2

您可以逐行读取文件并将正则表达式应用于每行，例如：

import re

list = list()
with open('logs.txt', 'r') as fp:
    line = fp.readline()
    while line:
        dates = re.findall('(\d+\/\d+\/\d+)', line)
        map(list.append, dates)
        line = fp.readline()

print(list)

输出：

['1/21/18', '1/21/18', '1/23/18', '1/24/18']

Answer 3

假设整个文本文件具有相同的格式，这应该可以工作。

def process():
    file = open('test.txt')

    dates = []

    for line in file.readlines():
        if line[0] != '~':
            dates.append(line.strip(' - ~text~').split(',')[0])

    return dates

print(process())

这是输出。

['1/21/18', '1/21/18', '1/23/18', '1/24/18']

Answer 4

您可以使用re.findall来完成

import re
text = '1/21/18, 22:48 - ~text~\n1/21/18, 22:48 - ~text~\n1/23/18, 22:48 - ~text~\n~text~\n~text~\n1/24/18, 22:48 - ~text~'
re.findall(r'^([\d/]+),', text, re.MULTILINE)
# ['1/21/18', '1/21/18', '1/23/18', '1/24/18']

从文本中构建日期数组

4 个答案: