在python中读取tsv文件时忽略反斜杠

时间:2016-04-20 08:11:28

标签: python csv python-3.x pandas dataframe

我有一个大的sep="|" tsv,其地址字段包含一堆带有以下内容的值

...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...

最终为:

line1)  ...xxx|yyy|Level 1 2 xxx Street\
line2)  (MYCompany)|...

尝试运行quote = 2将非数字转换为带有Pandas的read_table中的字符串,但它仍然将反斜杠视为新行。忽略包含反斜杠转义到新行的字段中值的行的有效方法是否有办法忽略\的新行?

理想情况下,它会准备数据文件,以便可以将其读入pandas中的数据框。

更新:在第3行显示5行破损。

1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie

2 个答案:

答案 0 :(得分:0)

我认为您可以首先使用sep尝试read_csv,其中 NOT 值似乎正确:

import pandas as pd
import io

temp=u"""
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
                                              0
0                        49  XXX  Ave|Australia
1              u7  38-46 South Street|Australia
2  XXX Margaret StreetNew South Wales|Australia
3                          Po box ZZZ|Australia

然后,您可以使用sep="|"to_csv df.to_csv('myfile.csv', header=False, index=False) print pd.read_csv('myfile.csv', sep="|", header=None) 0 1 0 49 XXX Ave Australia 1 u7 38-46 South Street Australia 2 XXX Margaret StreetNew South Wales Australia 3 Po box ZZZ Australia 创建新文件:

output

下一个解决方案,但没有创建新文件,而是写入变量io.StringIO,然后read_csv写入import pandas as pd import io temp=u""" 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret Street\ New South Wales|Australia Po box ZZZ|Australia""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep=";", header=None) print df 0 0 49 XXX Ave|Australia 1 u7 38-46 South Street|Australia 2 XXX Margaret StreetNew South Wales|Australia 3 Po box ZZZ|Australia output = df.to_csv(header=False, index=False) print output 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret StreetNew South Wales|Australia Po box ZZZ|Australia print pd.read_csv(io.StringIO(u""+output), sep="|", header=None) 0 1 0 49 XXX Ave Australia 1 u7 38-46 South Street Australia 2 XXX Margaret StreetNew South Wales Australia 3 Po box ZZZ Australia

14

如果我在您的数据中对其进行测试,则看起来1.和2.rows包含15个字段,接下来有两个import pandas as pd import io temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne 1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney 1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\ (My Company)|Australia|New South Wales|2000|Sydney 1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep=";", header=None) print df 0 0 1788768|1831171|208434489|2014-08-14 13:40:02|... 1 1788772|1831177|202234489|2014-08-14 13:41:37|... 2 1788776|1831182|205234489|2014-08-14 13:42:41|... 3 1788780|1831186|202634489|2014-08-14 13:43:46|... output = df.to_csv(header=False, index=False) 字段。

所以我从两行(3.和4)中删除了最后一项,也许这只是错字(我希望如此):

print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
        0        1          2                    3    4  5   6        7   \
0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   

       8                                      9          10               11  \
0  coupon                           49  XXX  Ave  Australia         Victoria   
1     NaN                 u7  38-46 South Street  Australia  New South Wales   
2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
3     NaN                             Po box ZZZ  Australia  New South Wales   

     12         13  
0  3025  Melbourne  
1  2116     Sydney  
2  2000     Sydney  
3  2444  NSW Other  
names=range(15)

但如果数据正确,请将参数print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15)) 0 1 2 3 4 5 6 7 \ 0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop 1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS 2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop 3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop 8 9 10 11 \ 0 coupon 49 XXX Ave Australia Victoria 1 NaN u7 38-46 South Street Australia New South Wales 2 NaN Level XXX Margaret Street(My Company) Australia New South Wales 3 NaN Po box ZZZ Australia New South Wales 12 13 14 0 3025 Melbourne NaN 1 2116 Sydney NaN 2 2000 Sydney Sydney 3 2444 NSW Other Port Macquarie 添加到read_csv

style$="width:{{ width }}px; background-color:red";

答案 1 :(得分:0)

以下是使用正则表达式的另一种解决方案:

import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()

#Replace '\\n' with '\' using regex

fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()

cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)

将产生以下结果:

enter image description here