优化Clean字符串函数

时间:2014-01-27 00:33:52

标签: python regex r replace

所以我在Python和R中都有一个干净的字符串函数。但是在Python中它似乎运行得慢得多 - 大约是R的时间的2-3倍。

所以我想知道是否有人有任何方法来优化它。我试图指出缓慢并找到它可能是使用re库,但我没有看到一种方法不使用它。如果有人能指出我如何使它更快的正确方向,那将是伟大的。

所需进口商品:

import re
import pandas as pd
from nltk.stem import SnowballStemmer
import string
import numpy as np

with open('input_data/stopwords.txt', 'rb') as f:
    stopwords= [line.strip() for line in f]
stopwords = r'\b(' + '|'.join(stopwords) + r')\b'
protwords = [line.strip() for line in open("input_data/protwords.txt", 'r')]
exceptions = [line.strip() for line in open("input_data/exceptions.txt", 'r')]

主要功能如下:

def cleanTitle(title):
    length = len(title)
    title = title.decode('windows-1252')
    title = title.encode('utf-8','replace')
    title = title.lower()
    if '&#' in title:
        title = replaceHTML(title)

    title = trimWhitespace(title, False)
    title = title.translate(string.maketrans("","") ,string.punctuation)
    title = replaceSynonyms(title)
    title = re.sub(stopwords,'', title)
    titleWords = set(title.split())

    stemmedWords = []
    for word in titleWords:
        if word.isalpha():
            if not word in protwords:
                word =  stemWord(word)
            if word:    
                stemmedWords.append(word)

    title = ' '.join(list(set(stemmedWords)))
    return str(title)

助手功能如下:

# STEMMER
stemmer = SnowballStemmer("english")

def stemWord(word):
    stemedWord = None
    try:
        stemedWord = stemmer.stem(word)
    except:
        print 'cannot stem'
    return stemedWord

# SYNONYMS
with open('input_data/synonyms.txt', 'rb') as f:
    synonyms = [line.strip().split(',') for line in f]

def replaceSynonyms(title):
    for words in synonyms:
        pattern = re.compile(r'\b(' + '|'.join(words) + r')\b')
        title = pattern.sub( words[0], title)
    return title

# HTML CODE REMOVER
htmlCodes = pd.read_csv('input_data/html_codes.csv',sep=',',header=0, na_values=[' '])

with open('input_data/synonyms.txt', 'rb') as f:
    codes = [line.strip().split(',') for line in f] 

def replaceHTML(title):
    for name, row in htmlCodes.iterrows():
        friend = row['friendly']
        display = re.escape(str(row['display']))
        hexC = row['hex']

        if isinstance(display, str):
            title = re.sub(hexC, display  ,title)
        if isinstance(friend, str):
            title = re.sub(friend,display, title)

    title = re.sub("&.{1,6};|<.{1,7}>", "", title)
    return title

不幸的是,我无法提供数据.csv文件本身,用于停用词,protwords,同义词,异常,但它应该很容易! tricker文件的格式为html_codes.csv,格式为:

display     friendly       hex
__________________________________
                        &#x09;
                        &#x10;
                        &#x20;
!                       &#x21;
"           &quot;      &#x22;
#                       &#x23;
编辑:我现在尝试使用re2代替re - &gt;增加时间

0 个答案:

没有答案