如何在Pandas.read_csv中使用方括号作为引号字符

时间:2016-03-11 19:30:57

标签: python csv pandas

假设我有一个看起来像这样的文本文件:

Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]

我希望能够做到的是用pandas.read_csv读取,但第二行会抛出错误。这是我目前正在使用的代码:

import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)

我尝试将quotechar设置为“[”,但这显然只是占用了行,直到下一个开括号并添加一个右括号会导致“找到长度为2的字符串”错误。任何见解将不胜感激。谢谢!

更新

提供了三种主要解决方案:1)为数据框提供多种名称,以允许读入所有数据,然后对数据进行后处理,2)在方括号中查找值并放置引号围绕它,或3)用分号替换前n个逗号。

总的来说,我认为选项3通常不是一个可行的解决方案(尽管对我的数据来说还不错),因为a)如果我在一个包含逗号的列中引用了值,并且b)如果我的列与方括号不是最后一栏?这留下了解决方案1和2.我认为解决方案2更具可读性,但解决方案1更有效,仅运行1.38秒,而解决方案2则运行3.02秒。测试在包含18列和超过208,000行的文本文件上运行。

3 个答案:

答案 0 :(得分:1)

我认为您可以replace,的每行文件中;出现sep=";",然后在read_csv中使用参数import pandas as pd import io with open('file2.csv', 'r') as f: lines = f.readlines() fo = io.StringIO() fo.writelines(u"" + line.replace(',',';', 3) for line in lines) fo.seek(0) df = pd.read_csv(fo, sep=';') print df Item Date Time Location 0 1 01/01/2016 13:41 [45.2344:-78.25453] 1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] 2 3 01/10/2016 01:27 [51.2344:-86.24432]

,

或者可以尝试这种复杂的方法,因为主要问题是,lists中值之间的分隔符import pandas as pd import io temp=u"""Item,Date,Time,Location 1,01/01/2016,13:41,[45.2344:-78.25453] 2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242] 3,01/10/2016,01:27,[51.2344:-86.24432]""" #after testing replace io.StringIO(temp) to filename #estimated max number of columns df = pd.read_csv(io.StringIO(temp), names=range(10)) print df 0 1 2 3 4 \ 0 Item Date Time Location NaN 1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN 2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242 3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN 5 6 7 8 9 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 41.2342:-81242] NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 与其他列值的分隔符相同。

所以你需要后期处理:

#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None

#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
              Location             NaN              NaN
1  [45.2344:-78.25453]             NaN              NaN
2   [43.3423:-79.23423  41.2342:-81242  41.2342:-81242]
3  [51.2344:-86.24432]             NaN              NaN

#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)

#subset of desired columns
print df[['Item','Date','Time','Location']]
  Item        Date   Time                                           Location
1    1  01/01/2016  13:41                                [45.2344:-78.25453]
2    2  01/03/2016  19:11  [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3    3  01/10/2016  01:27                                [51.2344:-86.24432]
{{1}}

答案 1 :(得分:1)

我想不出一种方法来欺骗CSV解析器接受不同的打开/关闭引号字符,但是你可以通过一个非常简单的预处理步骤来实现:

import pandas as pd
import io
import re

# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')

with open('path/to/file.txt', 'r') as fi:
    # replaced brackets with quotes, pipe into file-like object
    fo = io.StringIO()
    fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)

    # rewind file to the beginning
    fo.seek(0)

# read transformed CSV into data frame
df = pd.read_csv(fo)
print df

这会给你一个像

这样的结果
            Date_Time  Item                             Location
0 2016-01-01 13:41:00     1                  [45.2344:-78.25453]
1 2016-01-03 19:11:00     2  [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00     3                  [51.2344:-86.24432]

修改如果内存不是问题,那么最好是批量预处理数据而不是逐行预处理,就像Max's answer中所做的那样。

# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)

with open('path/to/file.csv', 'r') as fi:
    data = unicode(re.sub(location_regex, r'"\1"', fi.read()))

df = pd.read_csv(io.StringIO(data))

如果您提前知道文档中的唯一括号是围绕位置坐标的那些,并且保证它们是平衡的,那么您可以进一步简化它(Max建议逐行版本的这个,但我认为迭代是不必要的):

with open('/path/to/file.csv', 'r') as fi:
    data = unicode(fi.read().replace('[', '"').replace(']', '"')

df = pd.read_csv(io.StringIO(data))

以下是我用200k排3列数据集得到的计时结果。每次平均超过10次试验。

  • 数据框后期处理(jezrael's solution): 2.19s
  • 逐行正则表达式: 1.36s
  • 批量正则表达式: 0.39s
  • 批量字符串替换: 0.14s

答案 2 :(得分:1)

我们可以使用简单的技巧 - 引用带双引号的平衡方括号:

import re
import six
import pandas as pd


data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""

print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))

pd.set_option('display.expand_frame_repr', False)
print(df)

输出:

----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
   Item        Date   Time                                           Location        junk
0     1  01/01/2016  13:41                                [45.2344:-78.25453]  [aaaa,bbb]
1     2  01/03/2016  19:11                 [43.3423:-79.23423,41.2342:-81242]   [0,1,2,3]
2     3  01/10/2016  01:27                                [51.2344:-86.24432]     [12,13]
3     4  01/30/2016  05:55  [51.2344:-86.24432,41.2342:-81242,55.5555:-81242]  [45,55,65]

更新:如果您确定所有方括号都是余额,我们就不必使用RegEx'

import io
import pandas as pd

with open('35948417.csv', 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
    fo.seek(0)

df = pd.read_csv(fo)
print(df)