Question

我正在编写代码来按位置检索文本文件中的特定字符。例如，我希望文本中位置之间的字符序列为1043-1049，如：

......... acddex .............

......等等。我想要那个＆＃34; acddex＆＃34;顺序排除该文本。我知道它的顺序和位置。到目前为止我只能打开文件并输入我想要的位置，但我不知道如何计算整个文本的顺序，更难，整个文件是样本的组合，所以我还要设置重复/刷新＆＃34;＆gt;＆＃34;的特定字符之间的字符数，就像：

agoejngodgfjnsodjnfvsojdnvodfjnodjnfbodjngodjgndojgndlkfnvldfkngldjnfgdfjgnldjfngldjfngldfjngldjfngldjnfg dkjdnfgkjdnfgkjndfkgjndfjgnojfgnlfjngdljfngldjfng kdfjngkdfjngkjdndksjngskfjgndkfjgn

当我知道所需序列的起始位置时，我需要这些样本中的序列，这些样本位于同一个文件中。那我该怎么做呢？

注意：它不是一个短序列，大约有200,000个字符，我想让它报告1046到1052个位置之间的字符，例如。

Answer 1

Seek to the byte position of the start of the sequence you want，然后调用read并告诉它你想要多少字节。

示例：

~
\A  # start of the string
[^"']*+ #"# all that is not a quote
(?:
    " #"# opening quote
    (?=[^"]) #"# at least one character that isn't a quote
    [^"\\]*+ #"# all characters that are not quotes or backslashes
    (?:\\.[^"\\]*)*+ #"# an escaped character and the same (zero or more times)
    " #"# closing quote
    [^"']*  
  | #"# or same thing for single quotes
    '(?=[^'])[^'\\]*+(?:\\.[^'\\]*)*+'[^"']*
)*+
\z  # end of the string
~s  # singleline mode: the dot matches newlines too

注意：此答案假定文件是ASCII编码的，或使用其他编码，其中每个字符只是文件中的一个字节。

如果您要提取大量序列，请在开始搜索之前尝试按顺序获取序列，这样您就不会跳过文件。在使用它之后，请考虑使用文件上的mmap来分析代码，而不是正常打开。你可能会看到一些加速。（但与所有优化一样 - 确保首先进行分析，看看代码的这一部分是否真的是需要优化的部分！）

Answer 2

stuff = "agoejngodgfjnsodjnfvsojdnvodfjnodjnfbodjngodjgndojgndlkfnvldfkngldjnfgdfjgnldjfn"

print(stuff[10:20])

这将打印位置10到20的字符。

所以，如果你想要1043-1049：

print(stuff[1043:1049])

给定位置和长度，从文件中提取字符串

2 个答案: