Question

初学者程序员在这里，还在学习很多东西。现在我正在使用一个非常大的文本文件，我想查看不同文本块的字符频率。例如，角色＆＃34; a＆＃34;和＆＃34; b＆＃34;出现在文字[0：600]经文[600：1200]对比[1200：1800]等等。现在我只知道如何打印文字[0：600]，但我不知道怎么写语法告诉Python寻找＆＃34; a＆＃34;和＆＃34; b＆＃34;只在文本的那一部分。

我在想，也许最好的方式就是这样，对于我所拥有的每个块，请告诉我＆＃39; a＆＃39;和＆＃39; b＆＃39;。＆＃34;这看起来有用吗？

非常感谢你！

如果你想看，我到目前为止就是这样。它非常简单：

f = open('text.txt')
fa = f.read()

fa = fa.lower()
corn = re.sub(r'chr', '', fa) #delete chromosome title
potato = re.sub(r'[^atcg]', '', corn) #delete all other characters

print potato[0:50]

Answer 1

您已经知道如何拆分文字了。一般情况是：

interval = 600
chunks = [text[idx:idx+interval] for idx in range(0, len(text), interval)]

并计算字符串中子字符串（此案例a）的出现次数：

term = 'a'
term_counts = [chunk.count(term) for chunk in chunks]
# zip them together to make it nicer (not that zip returns an iterator with python 3.4)
chunks_with_counts = zip(chunks, term_counts)

示例：

>>> text = "The quick brown fox jumps over the lazy dog"
>>> interval = 3
>>> chunks = [text[idx:idx+interval] for idx in range(0, len(text), interval)]
>>> chunks
['The', ' qu', 'ick', ' br', 'own', ' fo', 'x j', 'ump', 's o', 'ver', ' th', 'e
 l', 'azy', ' do', 'g']
>>> term='o'
>>> term_counts = [chunk.count(term) for chunk in chunks]
>>> term_counts
[0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
>>> chunks_with_counts = zip(chunks, term_counts)
>>> list(chunks_with_counts)
[('The', 0), (' qu', 0), ('ick', 0), (' br', 0), ('own', 1), (' fo', 1), ('x j',
 0), ('ump', 0), ('s o', 1), ('ver', 0), (' th', 0), ('e l', 0), ('azy', 0), ('
do', 1), ('g', 0)]

Answer 2

您可以定位文件光标并从那里读取：

with open('myfile.txt') as myfile:
    myfile.seek(1200)
    text = myfile.read(600)

这将从位置1200开始读取600个字节。请注意，当文本中有Unicode字符时，位置可能会关闭。

Answer 3

是的，这似乎可行。你可以循环遍历你的文本：

def compare_characters(chunk):
    # check for frequency of a and b or whatever
    pass

chunksize = 600
i = 0
while i*chunksize < len(text):
    compare_characters(text[i*chunksize:(i+1)*chunksize])
    i+=1

如何在我的Python程序中查看不同的字符串块？

3 个答案: