我有一个带有制表符分隔符的文本文件,如下所示:
id name age sex Basis Salary
2345 john 23 M Monthly 6000
2345 john 23 M Yearly 72000
4356 mary 26 F Perday 225
4356 mary 26 F Monthly 7000
以id为键,我需要将Basis和Salary列值组合为结果文件中的列,如下所示。
注意:如果“每日”,“每月”或“每年”没有任何值,则应将其指定为“''”。
id Name age sex PerDay Monthly Yearly
2345 john 23 M ' ' 6000 72000
4356 mary 26 F 225 7000 ' '
我们如何以python方式做到这一点?
答案 0 :(得分:0)
mypath = '/path/to/file.csv'
with open(mypath) as fh:
lines = fh.readlines()
header, body = lines[0], lines[0:]
records = {}
for record in body:
id, name, age, sex, basis, salary = record.split('\t')
cached = records.get(id)
if cached:
cached[basis] = salary
records[id] = cached
else:
records[id] = {'id': id, "name": name, "age": age, "sex": sex, basis: salary, **{base: ' ' for base in
set(['Yearly', 'Monthly', 'Perday'])-{basis}}}
简要说明:
mypath
是您的.csv
文件的路径
我剥离标题,然后将所有记录作为字符串列表获取。接下来,我们遍历该列表
用\t
或制表符分隔每行,然后解压缩为原始结构
在原始id
上进行查找。如果已经处理过,我们只想添加一个带有相关薪水的basis
条目。如果还没有,那么我们将添加一条包含所有内容的记录,并根据要求使用提供的basis
或salary
来解压缩各个' '
值
答案 1 :(得分:0)
import re
# read each line in your code
input_file = open('filePath',r)
output_file = open('outfile.txt', 'w')
output_file.write('id Name age sex PerDay Monthly Yearly\n')
for line in input_file.readlines()[1:]: # excluding the first line
m = re.search("(\d+)\s+([A-Za-z]+)\s+(\d+)\s+([MmFf])\s+([A-Za-z]+)\s+(\d+)",line)
# >>>m
# >>><_sre.SRE_Match object; span=(0, 41), match='2345 john 23 M Monthly 6000'>
if m:
# >>>m.groups()
# >>>('Monthly', '6000')
# based on the montly and perday, multiply the second value and place in your output file
# based on m.group(5) - leave others as " "
# if monthly
# if early
# if daily
output_file.write("write your individual outputs" )
答案 2 :(得分:0)
我认为类似的方法效果最好。不过,它假定ID号是唯一的。
import csv
id_column = 0
melt_column = 4
value_column = 5
in_file = "file.csv"
out_file = "out.csv"
new_headers = ['id','Name','age','sex','PerDay','Monthly','Yearly']
header = None
data = dict()
with open(in_file) as csvfile:
for row in csv.reader(csvfile, delimiter="\t"):
if header is None:
header = row
continue
else:
melt_idx = new_headers.index(row[melt_column])
if row[id_column] not in data:
data[row[id_column]] = row[id_column:melt_column] + ["", "", ""]
data[row[id_column]][melt_idx] = row[value_column]
with open(out_file, mode="w") as csvfile:
writer = csv.writer(csvfile, delimiter="\t")
writer.writerow(new_headers)
for k, val in data.items():
writer.writerow(val)