我试图做一些非常基本的事情。我有一个制表符分隔的文本文件,包含2行,日期和名称。日期采用Excel数字格式。这是一个例子......
Bill to Name Date
James Doe 41929
Jane Doe 41852
Adam Adamson 42244
Adam Adamson 41529
我需要做的是遍历名称列表,找到每个人的最小日期和最大日期之间的差异,并将其输出到另一个列表。输出列表应该类似于上面的输入列表,除了每个名称只有一个,数字会更小。不是每个人都有一个以上的日期,有些名字只有一个,有些名字有30个。我几乎只是通过提供文件。
input_dir = "C:\\Users\\Intern\\Documents\\"
data_file = "Python.txt"
output_dir = "C:\\Users\\Intern\\Documents\\"
output_file_all = "Tenure.txt"
#testing file input
with open(input_dir + data_file,'r') as ifile :
for idx, row in enumerate(ifile.readlines()) :
print(row)
if idx > 0 :
break
哪种方法很好,但循环让我感到很困惑。我假设它对于ifile中的每个名称都是一样的,Tenure = max(date)-min(date)&#34 ;,但我不认为那会正确迭代。
答案 0 :(得分:2)
如果输入文件结构变得更复杂,使用csv module
将来会有所帮助。字典似乎是这个问题中正确的数据结构。 Defaultdict让我们不再写几行了。
import csv
from collections import defaultdict
d = defaultdict( list )
input_file = 'a.csv'
output_file = 'b.csv'
with open( input_file, 'rb' ) as infile:
reader = csv.reader(infile, delimiter='\t')
next(reader, None) # skip the header
for row in reader:
d[ row[0] ].append( int(row[1]) )
with open( output_file, 'wb' ) as outfile:
writer = csv.writer(outfile, delimiter='\t')
for key, value in d.items():
writer.writerow( [key, max(value) - min(value)] )
将输出显示为" b.csv" :
Jane Doe 0
James Doe 0
Adam Adamson 715
答案 1 :(得分:1)
IIUC你可以用pandas
包很容易地做到:
import pandas as pd
df = pd.DataFrame({'Bill to Name': ['James Doe', 'Jane Doe', 'Adam Adamson', 'Adam Adamson'], 'Date': [41929, 41852, 42244, 41529]})
print(df)
Bill to Name Date
0 James Doe 41929
1 Jane Doe 41852
2 Adam Adamson 42244
3 Adam Adamson 41529
result = df.groupby('Bill to Name').agg(lambda x: max(x) - min(x))
print(result)
Date
Bill to Name
Adam Adamson 715
James Doe 0
Jane Doe 0
答案 2 :(得分:0)
此代码应该这样做。请注意,当我处理一个非常大的文件时,我编写了一个二进制搜索来查找重复项,这有点复杂。我还使用索引来构建每个名称的日期列表。
导入csv index = [] input_file =' input.csv' output_file =' output.csv'
def find_name(index, name):
""" Binary search to see if the name exist in the index, yet. """
if len(index) == 0:
return -1
start = 0
limit = len(index) - 1
while start <= limit:
guess = (start + limit) / 2
if index[guess][0] == name:
return guess
elif index[guess][0] < name:
start = guess + 1
else:
limit = guess - 1
return -1
def add_to_index(index, name, date):
""" sorts the existing index. Sends the variables to "find_name".
if the name is round, returns the address of the name in the list.
if it's not found, it returns a -1. """
index.sort()
name_index = find_name(index, name)
if name_index == -1:
index.append([name, [date]])
else:
index[name_index][1].append(date)
""" Read throught each row of the input file, skipping the header.
send each row to the "add_to_index" function."""
with open( input_file, 'rb' ) as infile:
reader = csv.reader(infile, delimiter='\t')
next(reader, None) # skip the header
for row in reader:
add_to_index(index, row[0], row[1])
""" Write the output from the index back to the output file, only
showing writing the earliest date for each user. """
with open( output_file, 'wb' ) as outfile:
writer = csv.writer(outfile, delimiter='\t')
for e in index:
name = e[0]
e[1].sort()
days = int(e[1][-1]) - int(e[1][0])
writer.writerow([name, days])
所以......最终,这会构建一个如下所示的数据结构:
[[name, [date, date, date]], [name, [date, date]]
名称是唯一的,然后日期列表与每个唯一名称相关联。为了得到差异,我只是对日期元素进行了排序,并从最后一个元素[-1]中减去了第一个元素[0],我希望这是有道理的,但它肯定会在我的测试中正确创建文件。
输出文件如下所示:
Adam Adamson 715
James Doe 0
Jane Doe 0