我在此之前提出过类似问题Split multi delimiter columns into multiple column
目前的问题df
Unique
3:107912234-107912321(-):Cep290
4:107913333-107913322(+):Myra1
Y:222002110-221002100(+):Znpl1
MT:34330044-343123232(-):Brca2
X:838377373-834121212(+):AC007040.11
df_new = df['unique'].str.extract("(?P<chr>.*?):(?P<start>\d+)-(?P<end>\d+)\((?P<strand>[-+]:(?P<gene_n>[A-Za-z]d+))", expand=True)
print(df_new.head(5))
chr start end strand gene_n
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
我怎样才能分割出基本名称的最后一个字符串
中找到的元素的可能性genename[. or -][numbers]
genename[numbers]
答案 0 :(得分:2)
这个正则表达式可以解决问题
df['Unique'].str.extract('(?P<chr>.*):(?P<start>\d+)-(?P<end>\d+)\((?P<strand>.*)\):(?P<gene_n>.*)')
你得到了
chr start end strand gene_n
0 3 107912234 107912321 - Cep290
1 4 107913333 107913322 + Myra1
2 Y 222002110 221002100 + Znpl1
3 MT 34330044 343123232 - Brca2
4 X 838377373 834121212 + AC007040.11
你的解决方案没有处理strand的结束括号,在这种情况下,gene_n是字符和数字的混合。处理字母数字的最好方法是\ w +,[A-Za-z] d +不像其他人指出的那样工作