如何分割和排列文本/字符?

时间:2015-02-19 12:44:18

标签: python

这可能很容易,但作为一个初学者,在我看来是微不足道的。

我有这样的文字或包含此文字的文件:

'fdhdhjduvduvfbvhufbvufvhifbusdbjhkbueigvuerafvguavgugvg'

如何使用Python来分割文本:

'fdh dhj duv duv fbv huf bvu fvh ifb usd bjh kbu eig vue raf vgu avg ugvg'
'f dhd hjd uvd uvf bvh ufb vuf vhi fbu sdb jhk bue igv uer afv gua vgu gvg'
'fd hdh jdu vdu vfb vhu fbv ufv hif bus dbj hkb uei gvu era fvg uav gug vg'

然后需要计算三个seq的频率(例如多少'fdh')并排名所有最常去的seq。

我在这里看到了答案:What is the most "pythonic" way to iterate over a list in chunks?

但我不知道哪一个对我有好处。此外,我需要打开一个包含文本的文件并写入另一个文件。请给我一个专家意见。

编辑:

with open(fasta, 'r') as fin, open(outfile, 'w') as fout:
        for item in Counter(s[i:i+4] for i in range(len(fin))).most_common():
            fout.write(item)

with open(fasta, 'r') as fin, open(outfile, 'w') as fout: for item in Counter(s[i:i+4] for i in range(len(fin))).most_common(): fout.write(item)

给我错误

2 个答案:

答案 0 :(得分:0)

使用正则表达式将字符串拆分为3的块,然后使用字典理解生成一个dict,用于计算每个块的出现次数。

import re

chunked = re.findall('...', your_string)
result = {key: chunked.count(k) for key in set(chunked)}

编辑:在没有正则表达式的情况下进行分块,并捕获将字符串分割成3个块的不同方法,使用列表理解:

chunked = [your_string[i:i+3] for i in xrange(len(your_string))]

它不够优雅,但要处理'f''fd'个案,您可以简单地将它们连接到chunked的末尾:

chunked = [your_string[i:i+3] for i in xrange(len(your_string))] + [your_string[:1], your_string[:2]]

然后像以前一样应用字典理解:

result = {key: chunked.count(k) for key in set(chunked)}

结果:

{'afv': 1,
'avg': 1,
'bjh': 1,
'bue': 1,
'bus': 1,
'bvh': 1,
'bvu': 1,
'dbj': 1,
'dhd': 1,
'dhj': 1,
'duv': 2,
'eig': 1,
'era': 1,
'f': 1,
'fbu': 1,
'fbv': 2,
'fd': 1,
'fdh': 1,
'fvg': 1,
'fvh': 1,
'g': 1,
'gua': 1,
'gug': 1,
'gvg': 1,
'gvu': 1,
'hdh': 1,
'hif': 1,
'hjd': 1,
'hkb': 1,
'huf': 1,
'ifb': 1,
'igv': 1,
'jdu': 1,
'jhk': 1,
'kbu': 1,
'raf': 1,
'sdb': 1,
'uav': 1,
'uei': 1,
'uer': 1,
'ufb': 1,
'ufv': 1,
'ugv': 1,
'usd': 1,
'uvd': 1,
'uvf': 1,
'vdu': 1,
'vfb': 1,
'vg': 1,
'vgu': 2,
'vhi': 1,
'vhu': 1,
'vue': 1,
'vuf': 1}

答案 1 :(得分:0)

>>> from collections import Counter
>>> s = 'fdhdhjduvduvfbvhufbvufvhifbusdbjhkbueigvuerafvguavgugvg'
>>> for item in Counter(s[i:i+3] for i in range(len(s))).most_common():
...     print item
... 
('fbv', 2)
('vgu', 2)
('duv', 2)
('raf', 1)
('fbu', 1)
('dbj', 1)
('uei', 1)
('bvu', 1)
('vg', 1)
('bjh', 1)
('hjd', 1)
('bvh', 1)
('uvd', 1)
('ugv', 1)
('uvf', 1)
('kbu', 1)
('igv', 1)
('usd', 1)
('dhj', 1)
('fvh', 1)
('fvg', 1)
('dhd', 1)
('gvg', 1)
('afv', 1)
('uer', 1)
('gvu', 1)
('huf', 1)
('eig', 1)
('bus', 1)
('ufb', 1)
('avg', 1)
('sdb', 1)
('hif', 1)
('hkb', 1)
('gug', 1)
('uav', 1)
('ufv', 1)
('bue', 1)
('vuf', 1)
('gua', 1)
('vue', 1)
('vdu', 1)
('g', 1)
('vhu', 1)
('fdh', 1)
('jhk', 1)
('vfb', 1)
('vhi', 1)
('era', 1)
('ifb', 1)
('jdu', 1)
('hdh', 1)