Question

我认为这应该是非常简单的，但它是一个星期五的下午，我的大脑显然不清楚。

我正在编写一个小文件解析器，下面的代码将一组字符串转换为数据帧，将字符串分开。

以下是一些示例字符串：

1. NC_002523_1  Serratia entomophila plasmid pADAP, complete sequence.

2. NZ_CM003366_0    Pantoea ananatis strain CFH 7-1 plasmid CFH1-7plasmid2, whole genome shotgun sequence.

3. NZ_CP014491_0    Escherichia coli strain G749 plasmid pG749_3, complete sequence.

4. NC_015062_0  Rahnella sp. Y9602 plasmid pRAHAQ01, complete sequence.

我没有预料到第4个条目中.之后的sp，正如您在下面的代码中看到的那样，我分开了.以获得第一个等级的整数。因此，我得到一个ValueError，列数多于预期。

# Define the column headers for the section since the file's are too verbose and ambiguous
SigHit.Columns = ["Rank", "ID", "Description"]

# Store the table of loci and associated data (tab separated, removing last blank column.

# Use StringIO object to imitate a file, which means that we can use read_table and have the dtypes
# assigned automatically (necessary for functions like min() to work correctly on integers)

SigHit.Table = pd.read_table(
               io.StringIO(u'\n'.join([row.rstrip('.') for row in sighits_section])),
               sep='\.|\t',
               engine='python',
               names=SigHit.Columns)

我能想到的最简单的解决方案（直到其他边缘情况破坏它）是替换除第一次出现之外的每个.。怎么办呢？

我发现有maxreplace argument到.replace，但这会与我想要的相反，并且只会替换第一个实例。

有什么建议吗？（更强大的解析方法也是一个有效的选项，但我必须越少越好地改变代码。）

Answer 1

使用正向lookbehind确保点前面有一个数字 - sep='(?<=\d)\.|\t'

例如：

import pandas as pd
import io

columns = ["Rank", "ID", "Description"]

sighits_section = '''1. NC_002523_1\tSerratia entomophila plasmid pADAP, complete sequence.
2. NZ_CM003366_0\tPantoea ananatis strain CFH 7-1 plasmid CFH1-7plasmid2, whole genome shotgun sequence.
3. NZ_CP014491_0\tEscherichia coli strain G749 plasmid pG749_3, complete sequence.
4. NC_015062_0\tRahnella sp. Y9602 plasmid pRAHAQ01, complete sequence.'''.splitlines()

tab = pd.read_table(io.StringIO(u'\n'.join([row.rstrip('.') for row in sighits_section])),
                    sep='(?<=\d)\.|\t',
                    engine='python',
                    names=columns)

print(tab)

打印

   Rank              ID                                        Description
0     1     NC_002523_1  Serratia entomophila plasmid pADAP, complete s...
1     2   NZ_CM003366_0  Pantoea ananatis strain CFH 7-1 plasmid CFH1-7...
2     3   NZ_CP014491_0  Escherichia coli strain G749 plasmid pG749_3, ...
3     4     NC_015062_0  Rahnella sp. Y9602 plasmid pRAHAQ01, complete ...

为了更加安全，您可能希望将空格作为分隔符添加到点旁边 - sep='(?<=\d)\.\s|\t' - 以便在以下情况下进行缓解：您的描述中的10.1。这绝不是防弹。

更安全 - 当您一次处理一行数据时，您可以添加一个断言，即数字是字符串中的第一个字符sep='(?<=^\d)\.\s|\t'。但是，这将在高于10的数字上崩溃。

Answer 2

天真的方法

替换除第一次出现之外的每个.

line = "4. NC_015062_0  Rahnella sp. Y9602 plasmid pRAHAQ01, complete sequence."
count = line.count(".")
line = line[::-1].replace(".", "", count-1)[::-1]

这是一个单线

row[::-1].replace(".","",row.count(".")-1)[::-1]

在python

2 个答案: