我有一个csv文件(original.csv
),其中包含一个唯一的ID列(uid
)和我要评估的列,然后使用未修改的文件创建一个新文件(result.csv
) uid
并根据评估创建新列。
我的原始文件如下:
uid,var01,var02,var03,var04,var05
1,2,3,2,3,1
2,2,2,2,2,1
3,,2,2,1,1
4,2,2,2,1,1
5,1,2,2,1,2
6,3,,2,3,2
7,3,,1,1,1
8,2,3,1,,3
9,3,1,,3,
10,,3,2,3,3
我想做一个与此逻辑相同的评估(用SQL编写):case when var01 = 1 then 1 else 0 end as var01_new, case when var02 = 1 then 1 else 0 end as var02_new, ...
结果如下:
uid,var01_new,var02_new,var03_new,var04_new,var05_new
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,1,1
4,0,0,0,1,1
5,1,0,0,1,0
6,0,0,0,0,0
7,0,0,1,1,1
8,0,0,1,0,0
9,0,1,0,0,0
10,0,0,0,0,0
考虑到实际文件的大小(~20M行,50 +列),我希望将解决方案保留在基础Python
中,而不是像Pandas
和{{1}这样的内存限制包}。我试过modifying this S/O question但是我无法让它用于我的用例。
我尝试了这段代码但是没有用。
Numpy
答案 0 :(得分:1)
因此,Python不是像SQL那样纯粹的声明性语言,它是程序性的,所以你必须描述控制流,尽管它有许多声明性结构。所以,
>>> s = """uid,var01,var02,var03,var04,var05
... 1,2,3,2,3,1
... 2,2,2,2,2,1
... 3,,2,2,1,1
... 4,2,2,2,1,1
... 5,1,2,2,1,2
... 6,3,,2,3,2
... 7,3,,1,1,1
... 8,2,3,1,,3
... 9,3,1,,3,
... 10,,3,2,3,3"""
>>> reader = csv.reader(io.StringIO(s))
>>> result = io.StringIO()
>>> writer = csv.writer(result)
以上只是让我们假装我们使用流(io.StringIO
)来处理文件。但你会这样做,你已经使用你的with语句完成了它。现在,问题的症结在于:
>>> header = next(reader)
>>> writer.writerow(["{}_new".format(v) for v in header])
59
>>> for row in reader:
... new_row = [row[0]] # uid the same
... new_row.extend(1 if c == '1' else 0 for c in row[1:])
... writer.writerow(new_row)
...
13
13
13
13
13
13
13
13
13
14
>>> print(result.getvalue())
uid_new,var01_new,var02_new,var03_new,var04_new,var05_new
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,1,1
4,0,0,0,1,1
5,1,0,0,1,0
6,0,0,0,0,0
7,0,0,1,1,1
8,0,0,1,0,0
9,0,1,0,0,0
10,0,0,0,0,0
>>>
我使用了理解构造和条件表达式,它们允许更好,更具说明性的方式来转换数据。但是如果没有它们,您可以使用if-else
语句并构建行来执行相同的操作:
>>> result = io.StringIO()
>>> reader = csv.reader(io.StringIO(s))
>>> writer = csv.writer(result)
>>> header = next(reader)
>>> new_header = []
>>> for s in header:
... new_header.append("{}_new".format(s))
...
>>> writer.writerow(new_header)
59
>>> for row in reader:
... new_row = []
... for c in row:
... if c == '1':
... new_row.append(1)
... else:
... new_row.append(0)
... writer.writerow(new_row)
...
13
13
13
13
13
13
13
13
13
13
>>> print(result.getvalue())
uid_new,var01_new,var02_new,var03_new,var04_new,var05_new
1,0,0,0,0,1
0,0,0,0,0,1
0,0,0,0,1,1
0,0,0,0,1,1
0,1,0,0,1,0
0,0,0,0,0,0
0,0,0,1,1,1
0,0,0,1,0,0
0,0,1,0,0,0
0,0,0,0,0,0
答案 1 :(得分:1)
在您的代码中,您尝试分配的'uid' = 'uid'
和'var01_new' == 0
不正确,而您的代码会抛出异常SyntaxError: can't assign to literal
。
否则,您也可以在不使用csv
模块的情况下回答您的问题,例如:
我假设您的输入文件名为id_input.csv
,输出文件名为new.csv
:
data = ([k.strip(',')] for k in open("id_input.csv", 'r'))
condition = True
with open("new.csv", 'a') as f:
for k in data:
if condition:
f.write("uid,var01_new,var02_new,var03_new,var04_new,var05_new\n")
condition = False
else:
dd = k[0].split(",")
f.write(dd[0] + ',' + ",".join(j if j == '1' else '0' for j in dd[1:]) + '\n')
所以在上面的代码中并使用此输入:
uid,var01,var02,var03,var04,var05
1,2,3,2,3,1
2,2,2,2,2,1
3,,2,2,1,1
4,2,2,2,1,1
5,1,2,2,1,2
6,3,,2,3,2
7,3,,1,1,1
8,2,3,1,,3
9,3,1,,3,
10,,3,2,3,3
输出文件new.csv
将包含以下数据:
uid,var01_new,var02_new,var03_new,var04_new,var05_new
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,1,0
4,0,0,0,1,0
5,1,0,0,1,0
6,0,0,0,0,0
7,0,0,1,1,0
8,0,0,1,0,0
9,0,1,0,0,0
10,0,0,0,0,0