Question

我有一个CSV文件，其列有一百万条记录的不同日期范围。例如，我有这样的事情： 2004年至2016年; 1980年至2016年; 1991-2006; 2000-2012; 1998年 - 2011年

如果我想找出所有这些记录中最常见的3年，5年或7年范围，我将如何在Python中执行此操作？如果删除了一些记录并不重要，但我试图找出所有范围内最常见的7年频率或10年频率。有人可以帮忙吗？

Answer 1

您可以使用collections.defaultdict collections.Counter来处理此问题。 defaultdict用于按年数对日期范围进行分组，Counter会跟踪该年数的范围字符串计数：

from collections import defaultdict, Counter

year_ranges = defaultdict(Counter)

s = '2004-2016; 1980-2016; 1991-2006; 2000-2012; 1998 - 2011; 2004-2016'
for start, end in [r.split('-') for r in s.split('; ')]:
    start, end = int(start), int(end)
    year_ranges[end-start].update(['{}-{}'.format(start, end)])    # update counter for normalised range string

>>> print(year_ranges)
defaultdict(<class 'collections.Counter'>, {36: Counter({'1980-2016': 1}), 12: Counter({'2004-2016': 1, '2000-2012': 1}), 13: Counter({'1998-2011': 1}), 15: Counter({'1991-2006': 1})})

如果您想知道年份范围为12的最常见范围字符串：

>>> year_ranges[12].most_common(1)
[('2004-2016', 2)]

不确定如何处理多个范围字符串代表相同年份范围且具有相同数量的情况。

Answer 2

解析文件;获得2元素元组：((2004, 2016), (1980-2016), ...)。
将其转换为差异：(12, 36, ...)。
使用此序列创建Counter对象并调用most_common方法。

Answer 3

将您的序列拆分为单独的范围。假设你调用结果序列diff。

from collections import Counter

diff = ["2004-2016", "1980-2016", "1991-2006", "2000-2012", "1998 - 2011"]

diff_frequency = Counter( map( lambda x: abs( eval(x) ), diff) ).most_common()

most_common_diff = diff_frequency[0]

从python

3 个答案: