在python中完全复制R文本预处理

时间:2014-04-01 21:38:44

标签: python r nlp analytics scikit-learn

我想使用Python以与我在R中相同的方式预处理文档语料库。例如,给定初始语料库corpus,我想最终得到一个对应的预处理语料库使用以下R代码生成的那个:

library(tm)
library(SnowballC)

corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)

在Python中是否有一个简单或直接 - 最好是预先建立的方法?有没有办法确保完全相同的结果?


例如,我想预处理

  

@Apple ear pods令人惊叹!我听到入耳式耳机的最佳声音   曾经有过!

  

ear pod amaz最好的声音inear headphon我曾经

2 个答案:

答案 0 :(得分:3)

在预处理步骤中使nltktm之间的内容完全相同似乎很棘手,因此我认为最好的方法是使用rpy2在R中运行预处理将结果拉入python:

import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]

然后,您可以将其加载到scikit-learn中 - 要使CountVectorizerDocumentTermMatrix之间的内容匹配,您唯一需要做的就是删除条款长度小于3:

from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
    return [y for y in x.split() if len(y) > 2]

# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
#   with 8980 stored elements in Compressed Sparse Column format>

# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
#   with 4669 stored elements in Compressed Sparse Column format>

让我们验证这与R匹配:

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
# 
# Non-/sparse entries: 8980/3875329
# Sparsity           : 100%
# Maximal term length: 115 
# Weighting          : term frequency (tf)

sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
# 
# Non-/sparse entries: 4669/360260
# Sparsity           : 99%
# Maximal term length: 20 
# Weighting          : term frequency (tf)

如您所见,现在两种方法之间存储的元素和术语的数量完全匹配。

答案 1 :(得分:1)

CountVectorizerTfidfVectorizer可以按docs中的说明进行自定义。特别是,您需要编写自定义标记生成器,这是一个获取文档并返回术语列表的函数。使用NLTK:

import nltk.corpus.stopwords
import nltk.stem

def smart_tokenizer(doc):
    doc = doc.lower()
    doc = re.findall(r'\w+', doc, re.UNICODE)
    return [nltk.stem.PorterStemmer().stem(term)
            for term in doc
            if term not in nltk.corpus.stopwords.words('english')]

演示:

>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
 u'appl': 1,
 u'best': 2,
 u'ear': 3,
 u'ever': 4,
 u'headphon': 5,
 u'pod': 6,
 u'sound': 7,
 u've': 8}

(我链接到的示例实际上使用一个类来缓存引理器,但函数也可以工作。)