我有一个像这样的DataFrame df
:
Pattern String
101 hi, how are you?
104 what are you doing?
108 Python is good to learn.
我想为String Column创建ngrams。
我使用split()
和stack()
new= df.String.str.split(expand=True).stack()
但是,我想创建ngrams(bi,tri,quad等)
答案 0 :(得分:4)
对文本列进行一些预处理,然后进行一些转换+连接:
# generate unigrams
i = df.String\
.str.lower()\
.str.replace('[^a-z\s]', '')\
.str.split(expand=True)\
.stack()
# generate bigrams by concatenating unigram columns
j = i + ' ' + i.shift(-1)
# generate trigrams by concatenating unigram and bigram columns
k = j + ' ' + i.shift(-2)
# concatenate all series vertically, and remove NaNs
pd.concat([i, j, k]).dropna().reset_index(drop=True)
0 hi
1 how
2 are
3 you
4 what
5 are
6 you
7 doing
8 python
9 is
10 good
11 to
12 learn
13 hi how
14 how are
15 are you
16 you what
17 what are
18 are you
19 you doing
20 doing python
21 python is
22 is good
23 good to
24 to learn
25 hi how are
26 how are you
27 are you what
28 you what are
29 what are you
30 are you doing
31 you doing python
32 doing python is
33 python is good
34 is good to
35 good to learn
dtype: object
答案 1 :(得分:0)
everygrams()
函数返回连续n次序的ngram,例如:以下返回1至3克:
>>> from nltk import everygrams
>>> everygrams('a b c d'.split(), 1, 3)
<generator object everygrams at 0x1147e3410>
>>> list(everygrams('a b c d'.split(), 1, 3))
[('a',), ('b',), ('c',), ('d',), ('a', 'b'), ('b', 'c'), ('c', 'd'), ('a', 'b', 'c'), ('b', 'c', 'd')]
使用apply
:
>>> import pandas as pd
>>> from itertools import chain
>>> from nltk import everygrams, word_tokenize
>>> df = pd.read_csv('x.tsv', sep='\t')
>>> df
Pattern String
0 101 hi, how are you?
1 104 what are you doing?
2 108 Python is good to learn.
>>> df['String'].apply(lambda x: [' '.join(ng) for ng in everygrams(word_tokenize(x), 1, 3)])
0 [hi, ,, how, are, you, ?, hi ,, , how, how are...
1 [what, are, you, doing, ?, what are, are you, ...
2 [Python, is, good, to, learn, ., Python is, is...
Name: String, dtype: object
>>> list(chain(*list(df['1to3grams'])))
['hi', ',', 'how', 'are', 'you', '?', 'hi ,', ', how', 'how are', 'are you', 'you ?', 'hi , how', ', how are', 'how are you', 'are you ?', 'what', 'are', 'you', 'doing', '?', 'what are', 'are you', 'you doing', 'doing ?', 'what are you', 'are you doing', 'you doing ?', 'Python', 'is', 'good', 'to', 'learn', '.', 'Python is', 'is good', 'good to', 'to learn', 'learn .', 'Python is good', 'is good to', 'good to learn', 'to learn .']