所以我在Python和R中都有一个干净的字符串函数。但是在Python中它似乎运行得慢得多 - 大约是R的时间的2-3倍。
所以我想知道是否有人有任何方法来优化它。我试图指出缓慢并找到它可能是使用re库,但我没有看到一种方法不使用它。如果有人能指出我如何使它更快的正确方向,那将是伟大的。
所需进口商品:
import re
import pandas as pd
from nltk.stem import SnowballStemmer
import string
import numpy as np
with open('input_data/stopwords.txt', 'rb') as f:
stopwords= [line.strip() for line in f]
stopwords = r'\b(' + '|'.join(stopwords) + r')\b'
protwords = [line.strip() for line in open("input_data/protwords.txt", 'r')]
exceptions = [line.strip() for line in open("input_data/exceptions.txt", 'r')]
主要功能如下:
def cleanTitle(title):
length = len(title)
title = title.decode('windows-1252')
title = title.encode('utf-8','replace')
title = title.lower()
if '&#' in title:
title = replaceHTML(title)
title = trimWhitespace(title, False)
title = title.translate(string.maketrans("","") ,string.punctuation)
title = replaceSynonyms(title)
title = re.sub(stopwords,'', title)
titleWords = set(title.split())
stemmedWords = []
for word in titleWords:
if word.isalpha():
if not word in protwords:
word = stemWord(word)
if word:
stemmedWords.append(word)
title = ' '.join(list(set(stemmedWords)))
return str(title)
助手功能如下:
# STEMMER
stemmer = SnowballStemmer("english")
def stemWord(word):
stemedWord = None
try:
stemedWord = stemmer.stem(word)
except:
print 'cannot stem'
return stemedWord
# SYNONYMS
with open('input_data/synonyms.txt', 'rb') as f:
synonyms = [line.strip().split(',') for line in f]
def replaceSynonyms(title):
for words in synonyms:
pattern = re.compile(r'\b(' + '|'.join(words) + r')\b')
title = pattern.sub( words[0], title)
return title
# HTML CODE REMOVER
htmlCodes = pd.read_csv('input_data/html_codes.csv',sep=',',header=0, na_values=[' '])
with open('input_data/synonyms.txt', 'rb') as f:
codes = [line.strip().split(',') for line in f]
def replaceHTML(title):
for name, row in htmlCodes.iterrows():
friend = row['friendly']
display = re.escape(str(row['display']))
hexC = row['hex']
if isinstance(display, str):
title = re.sub(hexC, display ,title)
if isinstance(friend, str):
title = re.sub(friend,display, title)
title = re.sub("&.{1,6};|<.{1,7}>", "", title)
return title
不幸的是,我无法提供数据.csv文件本身,用于停用词,protwords,同义词,异常,但它应该很容易! tricker文件的格式为html_codes.csv,格式为:
display friendly hex
__________________________________
	

 
! !
" " "
# #
编辑:我现在尝试使用re2代替re - &gt;增加时间