Question

我有一个包含usertype，ID和属性描述的excel数据集。我在dataframe（df）中的python pandas中导入了这个文件。

现在我想将desciption中的内容分成一个单词，两个单词和三个单词。我可以在NLTK库的帮助下进行一个单词标记化。但我被困在两个和三个字的标记化。例如，列Description中的一行有句子

位于孟买主干道的全新住宅公寓，配有便携式水。

我希望将这句话拆分为

“A Brand”，“全新”，“新住宅”，“住宅公寓”......“便携式水”。

这个分裂应该反映在该列的每一行。

Image of my dataset in excel format

Answer 1

以下是使用ngrams中的nltk的小例子。希望它有所帮助：

from nltk.util import ngrams
from nltk import word_tokenize

# Creating test dataframe
df = pd.DataFrame({'text': ['my first sentence', 
                            'this is the second sentence', 
                            'third sent of the dataframe']})
print(df)

输入dataframe：

    text
0   my first sentence
1   this is the second sentence
2   third sent of the dataframe

现在，我们可以将ngrams与word_tokenize一起用于bigrams和trigrams，并将其应用于数据帧的每一行。对于bigram，我们将2的值传递给ngrams函数以及标记化的单词，而3的值则传递给三元组。 ngrams返回的结果类型为generator，因此会转换为列表。对于每一行，bigrams和trigrams的列表都保存在不同的列中。

df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
print(df)

结果：

                     text  \
0            my first sentence   
1  this is the second sentence   
2  third sent of the dataframe   

                                                   bigram  \
0                            [(my, first), (first, sentence)]   
1  [(this, is), (is, the), (the, second), (second, sentence)]   
2    [(third, sent), (sent, of), (of, the), (the, dataframe)]   

                                                     trigram  
0                                        [(my, first, sentence)]  
1  [(this, is, the), (is, the, second), (the, second, sentence)]  
2     [(third, sent, of), (sent, of, the), (of, the, dataframe)]

Pandas Dataframe列值拆分

1 个答案: