捕获“,”之间的文本

时间:2019-04-26 15:23:37

标签: python regex csv

我在包含逗号的文本中有一行。我想在逗号之间捕获数据。

line = "",,,,,,,,,ce: appears to assume ,that\n

我正在使用正则表达式捕获模式= (""),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)\\n

输出为:

Output 1
1.  ""
2.  ,
3.  Empty
4.  Empty
5.  Empty
6.  Empty
7.  Empty
8.  Empty
9.  ce: appears to assume
10. that

我想将输出作为:

Output 2
1.  ""
2.  Empty
3.  Empty
4.  Empty
5.  Empty
6.  Empty
7.  Empty
8.  Empty
9.  Empty
10. ce: appears to assume, that

基本上,我正在寻找某种通用贪婪方法,该方法会忽略文本之间的逗号','

4 个答案:

答案 0 :(得分:2)

正则表达式在这里似乎是错误的解决方案。如果您知道要进行多少次匹配(您指定了10个匹配项),那么您就知道了期望的逗号数。使用str.split

>>> line.split(',', 9)
['""', '', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']

答案 1 :(得分:2)

您可以在此处使用itertools.groupby来过滤长度:

import itertools

someline = '"",,,,,,,,ce: appears to assume ,that\n'

# Group by length greater than 0
res = [(i, ','.join(x)) for i,x in itertools.groupby(someline.split(','), key=lambda x: len(x)>0)]

# [(True, '""'), (False, ',,,,,,'), (True, 'ce: appears to assume ,that\n')]

# Then you can just gather your results
results = []
for i, x in res:
    if i is True:
        results.append(x)
    else:
        results.extend(x.split(','))

results
# ['""', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']

这避免了您不必检查一定数量的逗号(如果这不是每行的固定值)。

不同格式

但是,我认为真正的问题是逗号不仅是定界符,而且还是数据中的元素,这使这个问题有点模棱两可。对于docs,您似乎可以指定不同的输出格式,例如.tsv,将其与\t分开,从而完全避免了该问题:

tabula.convert_into("test.pdf", "output.tsv", output_format="tsv", pages='all')

然后您的行将如下所示:

someline = '""\t\t\t\t\t\t\t\tce: appears to assume ,that\n'

# Much easier to handle
someline.split('\t')

# ['""', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']

答案 2 :(得分:0)

不知道是否需要所有空白。也许这就是您要寻找的

separados = line.split(',,')

for i in range(len(separados)):
    try:  #you can add more custom filters here
        if separados[i][0] == ',': separados[i] = separados[i][1:]
    except: pass
    try:
        if separados[i][-1] == ',': separados[i] = separados[i][:-1]
    except: pass

这就是你得到的

'""'
''
''
''
'ce: appears to assume ,that\n'

答案 3 :(得分:0)

问题是.*匹配的字符太多,包括逗号。您应该创建与所有字符匹配的组,外的逗号,例如

^(""),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*)$

最后一个可以匹配逗号,因此它可以匹配ce: appears to assume ,that中的逗号

#!/usr/bin/env python

import re

reg = re.compile('^(""),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*)$')

match = reg.match('"",,,,,,,,,ce: appears to assume ,that\n')

for i in range(1,11):
    print('{:>2s}.  {}'.format(str(i),"Empty" if len(match.group(i))==0 else match.group(i)))

提供所需的输出

 1.  ""
 2.  Empty
 3.  Empty
 4.  Empty
 5.  Empty
 6.  Empty
 7.  Empty
 8.  Empty
 9.  Empty
10.  ce: appears to assume ,that```