Pandas有一个非常快速且美观的字符串方法extract()。此方法可与此类正则表达式完美配合:
strict_pattern = r"^(?P<pre_spacer>ACGAG)(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT)"
test_df
R1
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT
test_df.R1.str.extract(strict_pattern)
pre_spacer UMI post_spacer
21 ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAG ACGTGTCCACCA TGGAGTCT
但是由于它不是使用regex
包而是使用re
(如果我没记错的话),所以它不支持使用允许不匹配的正则表达式。这样的:
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
此正则表达式允许在pre_spacer和post_spacer序列中进行一次替换。
如本例所示,regex
包允许这种正则表达式:
seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()
{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}
我想要使extract()与这种正则表达式或任何快速解决方法兼容。
我已经做到了,但是比提取慢了12倍,而且我处理非常大的数据帧。
def extract_regex(pattern, seq):
m = regex.match(pattern,seq)
try:
d=m.groupdict()
return list(d.values())
except AttributeError:
return [np.nan]*3
test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))
test_df
R1 pre_spacer UMI post_spacer
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT ACGAG ACGTGTCCACCA TGGAGTCT
关于如何调整熊猫extract()
方法或以类似的速度提供所需功能的任何想法?
谢谢!
保罗。
答案 0 :(得分:1)
在pandas
使用regex
库进行编译之前,您无法在.extract
中使用这些功能。
您可能必须使用自定义方法来依靠.apply
:
import regex
import pandas as pd
test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})
lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")
empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])
def extract_regex(seq):
m = lax_pattern.search(seq)
if m:
return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) # list(m.groupdict().values())
else:
return empty_val
test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)
输出:
>>> test_df
R1 pre_spacer UMI post_spacer
0 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
1 AAAAGGGA