如何从csv中删除多余的引号?

时间:2016-12-27 03:23:35

标签: csv numpy tensorflow

我在阅读csv文件时遇到了麻烦。

我尝试了替换方法。但numpy不支持这一点。

csv文件格式是这样的。

"num","phone","sensorID","press","temp","accel","gps_lat","gps_lng","time"
"1","null","A0:E6:F8:7B:16:EA","0","17","1.25","0","0","2016-12-14 13:34:59"
"2","null","A0:E6:F8:7B:16:A9","0","18","1.19","0","0","2016-12-14 13:34:59"
"3","null","A0:E6:F8:7B:15:A5","0","18","1.19","0","0","2016-12-14 13:34:59"
"4","null","A0:E6:F8:7B:16:EA","0","17","1.25","0","0","2016-12-14 13:35:00"
"5","null","A0:E6:F8:7B:16:A9","0","18","1.19","0","0","2016-12-14 13:35:00"
"6","null","A0:E6:F8:7B:15:A5","0","19","1.38","0","0","2016-12-14 13:35:00"
"7","null","A0:E6:F8:7B:16:D6","0","18","1.12","0","0","2016-12-14 13:35:01"
"8","null","A0:E6:F8:7B:16:EA","0","17","1.31","0","0","2016-12-14 13:35:01"
"9","null","A0:E6:F8:7B:15:A5","0","19","1.38","0","0","2016-12-14 13:35:01"

但是当我在numpy.loadtxt中使用这个文件时,结果就像这样

源代码

import numpy as np
a= np.loadtxt('db_file.csv', delimiter=',', dtype='str', unpack=True)
print a

结果

[['"num"' '"1"' '"2"' ..., '"6979"' '"6980"' '"6981"']
 ['"phone"' '"null"' '"null"' ..., '" 821099631345"' '" 821099631345"'
  '" 821099631345"']
 ['"sensorID"' '"A0:E6:F8:7B:16:EA"' '"A0:E6:F8:7B:16:A9"' ...,
  '"A0:E6:F8:7B:16:EA"' '"A0:E6:F8:7B:16:A9"' '"A0:E6:F8:7B:16:D6"']
 ..., 
 ['"gps_lat"' '"0"' '"0"' ..., '37.596332"' '"37.596332"' '"37.596332"']
 ['"gps_lng"' '"0"' '"0"' ..., '"127.031773"' '"127.031773"' '"127.031773"']
 ['"time"' '"2016-12-14 13:34:59"' '"2016-12-14 13:34:59"' ...,
  '"2016-12-15 00:03:11"' '"2016-12-15 00:03:11"' '"2016-12-15 00:03:12"']]

我想删除“这一个。

所以我真的想要这份清单。

[['num', '1', '2' ..., '6979', '6980', '6981']
 ['phone', 'null', 'null' ..., '821099631345', ' 821099631345'
  ' 821099631345']
 ['sensorID', 'A0:E6:F8:7B:16:EA', 'A0:E6:F8:7B:16:A9' ...,
  'A0:E6:F8:7B:16:EA', 'A0:E6:F8:7B:16:A9', 'A0:E6:F8:7B:16:D6']
 ..., 
 ['gps_lat', '0', '0' ..., '37.596332' '37.596332' '37.596332']
 ['gps_lng' '0' '0' ..., '127.031773' '127.031773' '127.031773']
 ['time' '2016-12-14 13:34:59' '2016-12-14 13:34:59' ...,
  '2016-12-15 00:03:11' '2016-12-15 00:03:11' '2016-12-15 00:03:12']]

我使用什么代码?

3 个答案:

答案 0 :(得分:1)

从excel编辑器中找到替换双引号(“)到单引号(')。 因为我不知道您使用的是什么编辑器,所以我会一步一步地为您提供替换MS Excel中的任何字符。

https://support.office.com/en-us/article/Find-or-replace-text-and-numbers-on-a-worksheet-3a2c910f-01b9-4263-8db2-333dead6ae33

答案 1 :(得分:1)

使用numpy.char.strip

代码:

a = np.array(['"1"', '"2"', '"3"'])
a = np.char.strip(a, '"')
print(a)

输出:

['1' '2' '3']

答案 2 :(得分:0)

我得到了熊猫:

In [1278]: pd.read_csv('stack41338622.txt')
Out[1278]: 
   num phone           sensorID  press  temp  accel  gps_lat  gps_lng  \
0    1  null  A0:E6:F8:7B:16:EA      0    17   1.25        0        0   
1    2  null  A0:E6:F8:7B:16:A9      0    18   1.19        0        0   
2    3  null  A0:E6:F8:7B:15:A5      0    18   1.19        0        0   
3    4  null  A0:E6:F8:7B:16:EA      0    17   1.25        0        0   
4    5  null  A0:E6:F8:7B:16:A9      0    18   1.19        0        0   
5    6  null  A0:E6:F8:7B:15:A5      0    19   1.38        0        0   
6    7  null  A0:E6:F8:7B:16:D6      0    18   1.12        0        0   
7    8  null  A0:E6:F8:7B:16:EA      0    17   1.31        0        0   
8    9  null  A0:E6:F8:7B:15:A5      0    19   1.38        0        0   

                  time  
0  2016-12-14 13:34:59  
1  2016-12-14 13:34:59  
2  2016-12-14 13:34:59  
3  2016-12-14 13:35:00  
4  2016-12-14 13:35:00  
5  2016-12-14 13:35:00  
6  2016-12-14 13:35:01  
7  2016-12-14 13:35:01  
8  2016-12-14 13:35:01  

Reading CSV files in numpy where delimiter is ","中所述converters,我们可以删除额外的引号。不幸的是dtypes=None不再适用于转换器,所以我们必须拼出来。这是一个开始:

In [1327]: def foo(astr):
      ...:     return astr[1:-1]
In [1328]: convs = dict((col, foo) for col in range(9))
In [1329]: dt = ['i','S10','S20','i', 'i','f','i','i','S20']
In [1330]: data = np.genfromtxt('stack41338622.txt', dtype=dt, delimiter=',', names=True, converters=convs)
In [1331]: data
Out[1331]: 
array([ (1, b'null', b'A0:E6:F8:7B:16:EA', 0, 17, 1.25, 0, 0, b'2016-12-14 13:34:59'),
       (2, b'null', b'A0:E6:F8:7B:16:A9', 0, 18, 1.190000057220459, 0, 0, b'2016-12-14 13:34:59'),
       (3, b'null', b'A0:E6:F8:7B:15:A5', 0, 18, 1.190000057220459, 0, 0, b'2016-12-14 13:34:59'),
       (4, b'null', b'A0:E6:F8:7B:16:EA', 0, 17, 1.25, 0, 0, b'2016-12-14 13:35:00'),
       (5, b'null', b'A0:E6:F8:7B:16:A9', 0, 18, 1.190000057220459, 0, 0, b'2016-12-14 13:35:00'),
       (6, b'null', b'A0:E6:F8:7B:15:A5', 0, 19, 1.3799999952316284, 0, 0, b'2016-12-14 13:35:00'),
       (7, b'null', b'A0:E6:F8:7B:16:D6', 0, 18, 1.1200000047683716, 0, 0, b'2016-12-14 13:35:01'),
       (8, b'null', b'A0:E6:F8:7B:16:EA', 0, 17, 1.309999942779541, 0, 0, b'2016-12-14 13:35:01'),
       (9, b'null', b'A0:E6:F8:7B:15:A5', 0, 19, 1.3799999952316284, 0, 0, b'2016-12-14 13:35:01')], 
      dtype=[('num', '<i4'), ('phone', 'S10'), ('sensorID', 'S20'), ('press', '<i4'), ('temp', '<i4'), ('accel', '<f4'), ('gps_lat', '<i4'), ('gps_lng', '<i4'), ('time', 'S20')])

考虑到我花在这上面的时间,我倾向于采用其他建议 - 在文本编辑器中删除额外的引号。逗号分隔文件中不需要这些引号,而且比帮助更令人讨厌。

在编辑器中,我刚刚删除了"

num,phone,sensorID,press,temp,accel,gps_lat,gps_lng,time
1,null,A0:E6:F8:7B:16:EA,0,17,1.25,0,0,2016-12-14 13:34:59
2,null,A0:E6:F8:7B:16:A9,0,18,1.19,0,0,2016-12-14 13:34:59
3,null,A0:E6:F8:7B:15:A5,0,18,1.19,0,0,2016-12-14 13:34:59
4,null,A0:E6:F8:7B:16:EA,0,17,1.25,0,0,2016-12-14 13:35:00
5,null,A0:E6:F8:7B:16:A9,0,18,1.19,0,0,2016-12-14 13:35:00
...

In [1336]: data = np.genfromtxt('stack41338622_1.txt', dtype=None, delimiter=',', names=True)
In [1337]: data
Out[1337]: 
array([ (1, b'null', b'A0:E6:F8:7B:16:EA', 0, 17, 1.25, 0, 0, b'2016-12-14 13:34:59'),
       (2, b'null', b'A0:E6:F8:7B:16:A9', 0, 18, 1.19, 0, 0, b'2016-12-14 13:34:59'),
       (3, b'null', b'A0:E6:F8:7B:15:A5', 0, 18, 1.19, 0, 0, b'2016-12-14 13:34:59'),
       ..., 
      dtype=[('num', '<i4'), ('phone', 'S4'), ('sensorID', 'S17'), ('press', '<i4'), ('temp', '<i4'), ('accel', '<f8'), ('gps_lat', '<i4'), ('gps_lng', '<i4'), ('time', 'S19')])

b''是显示字节串的Python3方式。你不会在Py2中看到那些。