Finding separators/delimiters in lists of strings

时间:2016-07-11 23:02:25

标签: python

I am trying to find separators in a file that may or may not have separators, and what those separators are - if any - is also not known.

So far I have written the following code in an attempt to "solve" this:

strings = [
    'cabhb2k4ack_sfdfd~ffrref_lk',
    'iodja_24ed~092oi3jelk_fcjcad',
    'lkn04432m_90osidjlknxc~o_pf'
]

# Process first line
line1 = strings[0]
separators = set()
for sep in set(line1):
    separators.add(( sep, line1.count(sep) ))

# Process all the others
for line in strings:
    for sep,sepcount in separators.copy():
        if line.count(sep) != sepcount: separators.remove( (sep,sepcount) )

print separators

It returns the set: set([('_', 2), ('~', 1)]) which is good - but unfortunately does not contain the order of the separators in the file. In fact, its not even known if there was a consistent order for these separators.

The rules for separators are simple:

  1. They must occur the same number of times per line,
  2. They must occur in the same order on each line,
  3. None of the non-separators characters can be separator characters.

Note that in the example above, '4' was excluded as a separator as it comes up twice in the third string for reason 1 and 3.

The Question
How can I modify this code to check rule 2 correctly print the order of the separators?

2 个答案:

答案 0 :(得分:1)

我使用Counter代替.count,请skrrgwasme建议使用列表,并使用itertools.combinations帮助迭代可能的分隔符子集:

from collections import Counter
from itertools import combinations

def subsets(elems):
    for width in range(1, len(elems)+1):
        for comb in combinations(elems, width):
            yield comb

def sep_order(string, chars):
    chars = set(chars)
    order = tuple(c for c in string if c in chars)
    return order

def find_viable_separators(strings):
    counts = [Counter(s) for s in strings]
    chars = {c for c in counts[0]
             if all(count[c]==counts[0][c] for count in counts)}
    for seps in subsets(chars):
        orders = {sep_order(s, seps) for s in strings}
        if len(orders) == 1:
            yield seps, next(iter(orders))

给了我

>>> 
... strings = [
...     'cabhb2k4ack_sfdfd~ffrref_lk',
...     'iodja_24ed~092oi3jelk_fcjcad',
...     'lkn04432m_90osidjlknxc~o_pf'
... ]
... 
... for seps, order in find_viable_separators(strings):
...     print("possible separators:", seps, "with order:", order)
...             
possible separators: ('~',) with order: ('~',)
possible separators: ('_',) with order: ('_', '_')
possible separators: ('~', '_') with order: ('_', '~', '_')

答案 1 :(得分:1)

根据规则1,每个分隔符都有多个出现/行,从第一行到最后一行是稳定的。

我没有发现规则3表达得很好。我认为它必须被理解为:“用作分隔符的每个字符都不能在行中被认为是非分隔符的其他字符中找到”。

因此,根据规则1和3,每个出现次数/行数在两个连续行之间仅变化一次的字符不能是分隔符。

因此,以下代码的原则是
·首先创建第一行中出现的与第一行中出现次数相关的所有字符的列表sep_n
·然后沿着行列表S进行迭代,并消除列表sep_n中每个出现次数不相同的字符。

S = [
    'cabhb2k4ack_sfdfd~ffrref_lk',
    'iodja_24ed~092oi3jelk_fcjcad',
    'lkn04432m_90osidjlknxc~o_pf',
    'hgtr5v_8mgojnb5+87rt~lhiuhfj_n547'
    ]
# 1.They must occur the same number of times per line, 
line0 = S.pop(0)
sep_n = [ (c,line0.count(c)) for c in line0]
print(line0); print(sep_n,'\n')

for line in S:
    sep_n = [x for x in sep_n if line.count(x[0]) == x[1]]
    print(line); print(sep_n,'\n')

S.insert(0, line0)

# 2.They must occur in the same order on each line,
separators_in_order = [x[0] for x in sep_n]
print('separators_in_order : ',separators_in_order)
separators          = ''.join(set(separators_in_order))

for i,line in enumerate(S):
    if [c for c in line if c in separators] != separators_in_order:
        print(i,line)

如果行中的字符有足够的出现变化(除了分隔符),我的代码中sep_n的长度会随着列表的迭代而迅速减少。

指令sep_n = [ (c,line0.count(c)) for c in line0]负责在separators_in_order中获得的最终订单是列表S第一行中的订单。

但我无法想象一种方法来测试分隔符的顺序是从一行到另一行保持不变。实际上,在我看来,在迭代过程中不可能进行这样的测试,因为只有在迭代完全执行后才能完全知道分隔符列表。

这就是为什么必须在获得sep_n的值之后进行辅助控制的原因。它需要再次遍历列表S 问题是,如果“每个出现次数/行数变化的字符,即使在两个连续行之间只有一次也不能成为分隔符”,则可能会发生非分隔符在所有线路中出现严格相同的次数,因此无法根据出现的次数将其检测为非分离器。
但由于这样的非分隔符不会总是放在出现稳定的字符列表中的相同位置,因此可能进行二次验证。

最后,可能存在的一个极端情况如下:非分隔符出现在所有行中完全相同的出现,并且放在行中的分隔符之间,因此即使是它也无法检测到二次验证;
我不知道如何解决这个案子......

结果是

cabhb2k4ack_sfdfd~ffrref_lk
[('c', 2), ('a', 2), ('b', 2), ('h', 1), ('b', 2), ('2', 1), ('k', 3), ('4', 1), ('a', 2), ('c', 2), ('k', 3), ('_', 2), ('s', 1), ('f', 5), ('d', 2), ('f', 5), ('d', 2), ('~', 1), ('f', 5), ('f', 5), ('r', 2), ('r', 2), ('e', 1), ('f', 5), ('_', 2), ('l', 1), ('k', 3)] 

iodja_24ed~092oi3jelk_fcjcad
[('c', 2), ('a', 2), ('4', 1), ('a', 2), ('c', 2), ('_', 2), ('~', 1), ('_', 2), ('l', 1)] 

lkn04432m_90osidjlknxc~o_pf
[('_', 2), ('~', 1), ('_', 2)] 

hgtr5v_8mgojnb5+87rt~lhiuhfj_n547
[('_', 2), ('~', 1), ('_', 2)] 

separators_in_order :  ['_', '~', '_']

并且

S = [
    'cabhb2k4ack_sfd#fd~ffrref_lk',
    'iodja_24ed~092oi#3jelk_fcjcad',
    'lkn04432m_90osi#djlknxc~o_pf',
    'h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547'
    ]

结果是

cabhb2k4ack_sfd#fd~ffrref_lk
[('c', 2), ('a', 2), ('b', 2), ('h', 1), ('b', 2), ('2', 1), ('k', 3), ('4', 1), ('a', 2), ('c', 2), ('k', 3), ('_', 2), ('s', 1), ('f', 5), ('d', 2), ('#', 1), ('f', 5), ('d', 2), ('~', 1), ('f', 5), ('f', 5), ('r', 2), ('r', 2), ('e', 1), ('f', 5), ('_', 2), ('l', 1), ('k', 3)] 

iodja_24ed~092oi#3jelk_fcjcad
[('c', 2), ('a', 2), ('4', 1), ('a', 2), ('c', 2), ('_', 2), ('#', 1), ('~', 1), ('_', 2), ('l', 1)] 

lkn04432m_90osi#djlknxc~o_pf
[('_', 2), ('#', 1), ('~', 1), ('_', 2)] 

h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547
[('_', 2), ('#', 1), ('~', 1), ('_', 2)] 

separators_in_order :  ['_', '#', '~', '_']
1 iodja_24ed~092oi#3jelk_fcjcad
3 h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547


NB 1
指令line0 = S.pop(0)已完成 关闭指令for line in S[1:]:
因为S[1:]会创建一个新列表,这可能很重要。

NB 2
为了避免在sep_n的每一轮迭代中创建新的S列表, 最好按如下方式编写迭代:

for line in S:
    for x in sep_n:
        if line.count(x[0]) == x[1]:
            sep_n = [x for x in sep_n if line.count(x[0]) == x[1]]
            break
    print(line); print(sep_n,'\n')