在此处使用正则表达式而不是拆分，但仅用于说明使用拆分

Question

我正在编写一段很长的代码，执行时间过长。我在代码上使用cProfile，我发现以下函数被调用150次并且每次调用需要1.3秒，这导致单独使用此函数大约200秒。功能是 -

def makeGsList(sentences,org):
    gs_list1=[]
    gs_list2=[]
    for s in sentences:
        if s.startswith(tuple(StartWords)):
            s = s.lower()
            if org=='m':
                gs_list1 = [k for k in m_words if k in s]
            if org=='h':
                gs_list1 = [k for k in h_words if k in s]
            for gs_element in gs_list1:
                gs_list2.append(gs_element)
    gs_list3 = list(set(gs_list2))
    return gs_list3

代码应该是一个句子列表和一个标志org。然后它遍历每一行，检查它是否以列表StartWords中的任何单词开头，然后降低它。然后，根据org的值，它会列出当前句子中同时出现在m_words或h_words中的所有单词。它会将这些单词追加到另一个列表gs_list2。最后它创建了一组gs_list2并返回它。

有人可以给我任何关于如何优化此功能以减少执行时间的建议吗？

注意 - 单词h_words / m_words并非都是单个单词，其中很多都是包含3-4个单词的短语。

一些例子 -

StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']

sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n',  u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']

h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']

m_words类似。

关于尺寸 -

h_words和m_words列表的长度约为250,000 。列表中的每个元素平均长度为2个字。句子列表长约10-20个句子，我提供了一个示例列表，让您了解每个句子的大小。

Answer 1

请勿将全局变量用于k_words和if。
将for语句放在tuple(StartWords)循环之外。
一劳永逸地投射append()。
使用以编程方式创建的正则表达式而不是列表推导。
预编译所有内容。
直接扩展您的列表，而不是通过它迭代到set每个元素。
从头开始使用list而不是for。
使用set comprehension而不是显式m_reg = re.compile("|".join(re.escape(w) for w in m_words)) h_reg = re.compile("|".join(re.escape(w) for w in h_words)) def make_gs_list(sentences, start_words, m_reg, h_reg, org): if org == 'm': reg = m_reg elif org == 'h': reg = h_reg matched = {w for s in sentences if s.startswith(start_words) for w in reg.findall(s.lower())} return matched循环。

@@

Answer 2

我会尝试这个

# optionaly change these regexes
FIRST_WORD_RE = re.compile(r"^[a-zA-Z]+")
LOWER_WORD_RE = re.compile(r"[a-z]+")
m_or_h_words = {'m': set(m_words), 'h': set(h_words)}
startwords_set = set(StartWords)

def makeGsList(sentences, org):
    words = m_or_h_words[org]
    gs_set2 = set()
    for s in sentences:
        mo = FIRST_WORD_RE.match(s)
        if mo and mo.group(0) in startwords_set:
            gs_set2 |= set(LOWER_WORD_RE.findall(s.lower())) & words
    return list(gs_set2)

Answer 3

我认为你可以通过标记你的句子来解决这个问题

所以你要这样做：

在此处使用正则表达式而不是拆分，但仅用于说明使用拆分

句子= tuple（s.split（＆＃39;＆＃39;）对于句子中的s）然后使用Starts Words并将它们放在一个

中，而不是使用startswith

所以 sw_set = {w for Starts Words}

然后当你迭代你的句子时，做：如果sw_set中的s [0]：＃继续执行其余的逻辑

我认为这是你获得最大性能影响的地方。

Answer 4

在Python中，搜索集合比搜索列表要快得多，因此您始终可以将列表转换为set，然后尝试在set而不是list上搜索单词。这是我的示例代码：

 for i in range(0, num_reviews):
    text = raw_review["review"][i]).lower()  # Convert to lower case
    words = text.split()  # Split into words
    ### convert the stopwords from list to a set
    stops = set(stopwords.words("english"))
    # Remove stop words from "words"
    meaningful_words = [w for w in words if not w in stops]
    # Join the words back into one string
    BS_reviews.append(" ".join(meaningful_words))
 return BS_reviews

这个功能可以针对速度进行优化吗？

4 个答案:

在此处使用正则表达式而不是拆分，但仅用于说明使用拆分