大熊猫提取正则表达式允许不匹配

时间:2019-09-13 09:38:05

标签: regex python-3.x pandas extract fuzzy-search

Pandas有一个非常快速且美观的字符串方法extract()。此方法可与此类正则表达式完美配合:

strict_pattern = r"^(?P<pre_spacer>ACGAG)(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT)"

test_df

    R1
21  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22  ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23  ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24  ACGAGAATAACGTTTGGTGGAGTCTACCAC
25  ACGAGGGGAATAAATATTGGAGTCTCCTCC
26  ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27  ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28  ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29  ACGAGACGTGTCCACCATGGAGTCTTGTCT
test_df.R1.str.extract(strict_pattern)

    pre_spacer  UMI     post_spacer
21  ACGAG   TTTTCGTATTTT    TGGAGTCT
22  ACGAG   TAGGGAGGGGGG    TGGAGTCT
23  ACGAG   GGGGGGGAGGC     TGGAGTCT
24  ACGAG   AATAACGTTTGG    TGGAGTCT
25  ACGAG   GGGAATAAATAT    TGGAGTCT
26  ACGAG   ATTGGGTATGC     TGGAGTCT
27  ACGAG   GTACCCGCGCCA    TGGAGTCT
28  ACGAG   TGGTTTTTGTCG    TGGAGTCT
29  ACGAG   ACGTGTCCACCA    TGGAGTCT

但是由于它不是使用regex包而是使用re(如果我没记错的话),所以它不支持使用允许不匹配的正则表达式。这样的:

lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"

此正则表达式允许在pre_spacer和post_spacer序列中进行一次替换。

如本例所示,regex包允许这种正则表达式:

seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()

{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}

我想要使extract()与这种正则表达式或任何快速解决方法兼容。

我已经做到了,但是比提取慢了12倍,而且我处理非常大的数据帧。

def extract_regex(pattern, seq):
    m = regex.match(pattern,seq)
    try:
        d=m.groupdict()
        return list(d.values())
    except AttributeError:
        return [np.nan]*3

test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))

test_df

    R1  pre_spacer  UMI     post_spacer
21  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG  ACGAG   TTTTCGTATTTT    TGGAGTCT
22  ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG  ACGAG   TAGGGAGGGGGG    TGGAGTCT
23  ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT  ACGAG   GGGGGGGAGGC     TGGAGTCT
24  ACGAGAATAACGTTTGGTGGAGTCTACCAC  ACGAG   AATAACGTTTGG    TGGAGTCT
25  ACGAGGGGAATAAATATTGGAGTCTCCTCC  ACGAG   GGGAATAAATAT    TGGAGTCT
26  ACGAGATTGGGTATGCTGGAGTCTCTGTTC  ACGAG   ATTGGGTATGC     TGGAGTCT
27  ACGAGGTACCCGCGCCATGGAGTCTCTCTG  ACGAG   GTACCCGCGCCA    TGGAGTCT
28  ACGAGTGGTTTTTGTCGTGGAGTCTCACCA  ACGAG   TGGTTTTTGTCG    TGGAGTCT
29  ACGAGACGTGTCCACCATGGAGTCTTGTCT  ACGAG   ACGTGTCCACCA    TGGAGTCT

关于如何调整熊猫extract()方法或以类似的速度提供所需功能的任何想法?

谢谢!

保罗。

1 个答案:

答案 0 :(得分:1)

pandas使用regex库进行编译之前,您无法在.extract中使用这些功能。

您可能必须使用自定义方法来依靠.apply

import regex
import pandas as pd

test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})

lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")

empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])

def extract_regex(seq):
    m = lax_pattern.search(seq)
    if m:
        return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) #  list(m.groupdict().values())
    else:
        return empty_val


test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)

输出:

>>> test_df
                               R1 pre_spacer           UMI post_spacer
0  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG      ACGAG  TTTTCGTATTTT    TGGAGTCT
1                        AAAAGGGA