Question

我有一个大约有16,000行的文件。它们都具有相同的格式。这是一个简单的例子：

String username="example@abc.com";
String password="password#123";
Uri.Builder builder = new Uri.Builder();
    builder.scheme("https")
            .authority("xyz.abc.com")
            .appendPath("app")
            .appendPath("outh")
            .appendPath("token")

            .appendQueryParameter("grant_type", "password")
            .appendQueryParameter("client_id", "someid")
            .appendQueryParameter("client_secret", "some-secret")
            .appendQueryParameter("scope", "read,write,trust")
            .appendQueryParameter("username", username)
            .appendQueryParameter("password", password);
    ;

我需要检查包含字符串ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00 <...> ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00且标识符为DPPC的行是否在标识符切换为18之前形成50行阻止等。

所以现在，我有以下代码：

在这里我卡住了。我找到了一些如何比较后续行的例子，但类似的方法在这里不起作用。我仍然无法弄清楚如何比较cnt = 0 with open('test_file.pdb') as f1: with open('out','a') as f2: lines = f1.readlines() for i, line in enumerate(lines): if "DPPC" in line: A = line.strip()[22:26] if A[i] == A [i+1]: cnt = cnt + 1 elif A[i] != A[i+1]: cnt = 0中A的值与line[i]中A的值。

Answer 1

试试这个（评论中的解释）。

data = """ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00"""

# The last code seen in the 5th column.
code = None

# The count of lines of the current code.
count = 0

for line in data.split("\n"):
    # Get the 5th column.
    c = line.split()[4]

    # The code in the 5th column changed.
    if c != code:
        # If we aren't at the start of the file, print the count
        # for the code that just ended.
        if code:
            print("{}: {}".format(code, count))

        # Rember the new code.
        code = c

    # Count the line
    count = count + 1

# Print the count for the last code.
print("{}: {}".format(code, count))

输出：

18: 9
19: 19

Answer 2

由于您的数据似乎是固定宽度记录中的固定宽度字段，因此您可以使用struct模块快速将每一行划分为单个字段。

当你只需要处理其中一个时，解析每一行的所有字段可能会有些过分，但我是按照所示方式进行的，以说明如果你需要做其他事情它是如何完成的处理 - 并使用struct模块使其在任何情况下都相对较快。

我们说输入文件只包含以下几行数据：

ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   20      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   20      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   20      23.050  20.800  11.000  1.00  0.00

您需要做的就是记住前一行的字段值，以便将其与当前行进行比较。要开始这个过程，必须分别读取和解析第一行，因此需要prev值与后续行进行比较。另请注意，第5个字段是[4]索引的字段，因为第一个字段从[0]开始。

import struct

# negative widths represent ignored padding fields
fieldwidths = 4, -4, 3, -2, 2, -2, 4, -3, 2, -6, 6, -2, 6, -2, 6, -2, 4, -2, 4
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                    for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from  # a function to split line up into fields

with open('test_file.pdb') as f1:
    prev = parse(next(f1))[4]  # remember value of fifth field
    cnt = 1
    for line in f1:
        curr = parse(line)[4]  # get value of fifth field
        if curr == prev:  # same as last one?
            cnt += 1
        else:
            print('{} occurred {} times'.format(prev, cnt))
            prev = curr
            cnt = 1
    print('{} occurred {} times'.format(prev, cnt))  # for last line

输出：

18 occurred 9 times
19 occurred 7 times
20 occurred 3 times

Answer 3

您也可以使用并行列表轻松解决此问题：

data = []
with open('data.txt', 'r') as datafile:
    for line in datafile:
        line=line.strip()
        if line:
            data.append(line);


keywordList = []
for line in data:
    line = line.split()
    if (line[4] not in keywordList):
        keywordList.append(line[4])


counterList = []
for item in keywordList:
    counter = 0
    for line in data:
        line = line.split()
        if (line[4] == item):
            counter+=1
    counterList.append(counter)


for i in range(len(keywordList)):
    print("%s: %d"%(keywordList[i],counterList[i]));

但如果你熟悉dict，我会选择Lutz的答案。

如何将一行中的字符串与下一行中的字符串进行比较？

3 个答案: