Question

我正在寻找一个python代码以进行转换：

scaffold_356_1-1000_+__Genus_species

进入

scaffold_356_Gen_spe

因此，想法是首先在__部分后用3 first letters减少名称，以便从Genus_species到Gen_spe

并从其中删除 number-number 部分，因此删除_1-1000_+_

感谢您的帮助:)

我实际上知道要做：

import re 
name = "scaffold_356_1-1000_+__Genus_species"
name=re.sub(r'\d+\-\d*',"",name)
name = re.sub(r'__.__',"_",name)

我得到：

scaffold_356_Genus_species

Answer 1

您快到了。我会将字符串拆分为前缀和后缀，然后分别对其进行修改，然后将其重新加入。

import re
s = 'scaffold_356_1-1000_+__Genus_species'

#Split to suffix and prefix
suffix, prefix = s.split('__')
#scaffold_356_1-1000_+, Genus_species

#Get first three characters for prefix
modified_prefix = '_'.join([s[0:3] for s in prefix.split('_')])
#Gen_spe

#Do the regex replace for digits and remove the underscore and + at end of string
modified_suffix =re.sub(r'\d+\-\d*',"",suffix).rstrip('_+\\+')
#scaffold_356

#Join the strings back
final_s = modified_suffix  + '_' + modified_prefix
print(final_s)
#scaffold_356_Gen_spe

Answer 2

这是我的解决方案，它对您的输入模式非常敏感：

name = "scaffold_356_1-1000_+__Genus_species"
comp_list = name.split("_")
result = comp_list[0] + "_" + comp_list[1] + "_" + comp_list[5][0:3] + "_" + comp_list[6][0:3]
print(result) # scaffold_356_Gen_spe

此解决方案的最大优点是其可读性（IMHO）。

Answer 3

您似乎正在尝试进行模式化文本操作，因此正则表达式非常适合。很难从单个示例中进行概括-您可以更精确地描述转换，从而更容易制作正则表达式以执行所需的操作。有关正则表达式的Python文档是一个有用的参考：https://docs.python.org/3/library/re.html

如果我必须根据您的示例和描述来概括模式，则可以编写以下正则表达式：

import re

myre = re.compile(
    r'([A-Za-z]+_[\d]+)' # This will match "scaffold_356" in the first group
    r'_[\d]+-[\d]+_\+_' # This will match "_1-1000_+_" ungrouped
    r'(_[A-Za-z]{3})' # This will match _Gen and put it in the second group
    r'[A-Za-z]*' # This will match any additional letters, ungrouped
    r'(_[A-Za-z]{3})' # This will match _Gen and put it in the third group
)

如果您随后尝试使用此正则表达式，则可以看到它会将想要构造的部分提取到最终结果中：

matches = myre.match('scaffold_356_1-1000_+__Genus_species')
print(''.join(matches)) # prints scaffold_356_Gen_spe

当然，此正则表达式仅适用于非常特定的模式，如果不严格遵守该模式，将不会原谅。

Answer 4

可能不是最优雅的解决方案，但是它可以在您始终使用string_3digits_1digit-4digits _ + __ string_string的模式的情况下起作用。

import re

a_string = 'scaffold_356_1-1000_+__Genus_species'

new = re.findall('^([a-zA-Z]+_[0-9][0-9][0-9]_).+?_\+__([a-zA-Z][a-zA-Z][a-zA-Z]).*(_[a-zA-Z][a-zA-Z][a-zA-Z]).*', a_string)

print(''.join(list(new[0])))
# scaffold_356_Gen_spe

此示例使用带有捕获组的正则表达式模式。您可能想稍微玩一点regex来了解模式的结构。如果您插入此正则表达式模式，regex101将为您提供对每一项的理解。

在python中删除变量的特定部分

4 个答案: