Question

我试图做一些非常基本的事情。我有一个制表符分隔的文本文件，包含2行，日期和名称。日期采用Excel数字格式。这是一个例子......

Bill to Name    Date
James Doe       41929
Jane Doe        41852
Adam Adamson    42244
Adam Adamson    41529

我需要做的是遍历名称列表，找到每个人的最小日期和最大日期之间的差异，并将其输出到另一个列表。输出列表应该类似于上面的输入列表，除了每个名称只有一个，数字会更小。不是每个人都有一个以上的日期，有些名字只有一个，有些名字有30个。我几乎只是通过提供文件。

input_dir = "C:\\Users\\Intern\\Documents\\"
data_file = "Python.txt"
output_dir = "C:\\Users\\Intern\\Documents\\"
output_file_all = "Tenure.txt"

#testing file input
with open(input_dir + data_file,'r') as ifile :
    for idx, row in enumerate(ifile.readlines()) :
        print(row)
        if idx > 0 :
            break

哪种方法很好，但循环让我感到很困惑。我假设它对于ifile中的每个名称都是一样的，Tenure = max（date）-min（date）＆＃34 ;,但我不认为那会正确迭代。

Answer 1

如果输入文件结构变得更复杂，使用csv module将来会有所帮助。字典似乎是这个问题中正确的数据结构。 Defaultdict让我们不再写几行了。

import csv
from collections import defaultdict

d = defaultdict( list )

input_file = 'a.csv'
output_file = 'b.csv'

with open( input_file, 'rb' ) as infile:
    reader = csv.reader(infile, delimiter='\t')
    next(reader, None)  # skip the header
    for row in reader:
        d[ row[0] ].append( int(row[1]) )

with open( output_file, 'wb' ) as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    for key, value in d.items():
        writer.writerow( [key, max(value) - min(value)] )

将输出显示为＆＃34; b.csv＆＃34; ：

Jane Doe        0
James Doe       0
Adam Adamson    715

Answer 2

IIUC你可以用pandas包很容易地做到：

import pandas as pd
df = pd.DataFrame({'Bill to Name': ['James Doe', 'Jane Doe', 'Adam Adamson', 'Adam Adamson'], 'Date': [41929, 41852, 42244, 41529]})

print(df)
   Bill to Name   Date
0     James Doe  41929
1      Jane Doe  41852
2  Adam Adamson  42244
3  Adam Adamson  41529

result = df.groupby('Bill to Name').agg(lambda x: max(x) - min(x))

print(result)
               Date
Bill to Name       
Adam Adamson    715
James Doe         0
Jane Doe          0

Answer 3

此代码应该这样做。请注意，当我处理一个非常大的文件时，我编写了一个二进制搜索来查找重复项，这有点复杂。我还使用索引来构建每个名称的日期列表。

导入csv index = [] input_file =＆＃39; input.csv＆＃39; output_file =＆＃39; output.csv＆＃39;

def find_name(index, name):
    """ Binary search to see if the name exist in the index, yet. """
    if len(index) == 0:
        return -1
    start = 0
    limit = len(index) - 1
    while start <= limit:
        guess = (start + limit) / 2
        if index[guess][0] == name:
            return guess
        elif index[guess][0] < name:
            start = guess + 1
        else:
            limit = guess - 1
    return -1

def add_to_index(index, name, date):
    """ sorts the existing index.   Sends the variables to "find_name".
        if the name is round, returns the address of the name in the list.
        if it's not found, it returns a -1. """
    index.sort()
    name_index = find_name(index, name)
    if name_index == -1:
        index.append([name, [date]])
    else:
        index[name_index][1].append(date)


""" Read throught each row of the input file, skipping the header. 
    send each row to the "add_to_index" function."""
with open( input_file, 'rb' ) as infile:
    reader = csv.reader(infile, delimiter='\t')
    next(reader, None)  # skip the header
    for row in reader:
        add_to_index(index, row[0], row[1])

""" Write the output from the index back to the output file, only
    showing writing the earliest date for each user. """
with open( output_file, 'wb' ) as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    for e in index:
        name = e[0]
        e[1].sort()
        days = int(e[1][-1]) - int(e[1][0])
        writer.writerow([name, days])

所以......最终，这会构建一个如下所示的数据结构：

[[name, [date, date, date]], [name, [date, date]]

名称是唯一的，然后日期列表与每个唯一名称相关联。为了得到差异，我只是对日期元素进行了排序，并从最后一个元素[-1]中减去了第一个元素[0]，我希望这是有道理的，但它肯定会在我的测试中正确创建文件。

输出文件如下所示：

Adam Adamson    715
James Doe       0
Jane Doe        0

迭代名称列表并使用最小/最大值创建新列表

3 个答案: