在熊猫数据框中形成单词的双字

时间:2019-02-14 15:37:57

标签: python pandas nltk n-gram

我一直在尝试将包含已标记化单词的熊猫数据框转换为双字母组,但未成功。我尝试了多个代码,但是我一直收到错误消息或奇怪的答案。我大约2周前才开始使用python,对此我真的很挣扎。任何帮助,将不胜感激。谢谢

这是我到目前为止尝试过的。

from nltk.util import ngrams

generic_tweets['bigrams'] = generic_tweets['tweet'].apply(lambda row: list(map(lambda x:ngrams(x,2), row)))   
generic_tweets['bigrams'].head()

其中

generic_tweets['tweet'].head() 

0         [awww, thats, bummer, shoulda, got, david, car...
1         [upset, that, he, cant, update, his, facebook,...
2         [dived, many, time, ball, managed, save, rest,...
3            [whole, body, feel, itchy, like, it, on, fire]
4         [no, it, not, behaving, at, all, im, mad, why,...
5                                        [not, whole, crew]
6                                               [need, hug]

我想要的是

0         [(awww, thats), (thats, bummer), (bummer, shoulda)...
1         [(upset, that), (that, he), (he, cant), (cant, update)...
2         [(dived, many), (many, time), (time, ball), (ball, managed)...

但是我得到的是

0    [<generator object ngrams at 0x000002A38014B84...
1    [<generator object ngrams at 0x000002A30BA0AB1...
2    [<generator object ngrams at 0x000002A3A9182B8...
3    [<generator object ngrams at 0x000002A3A918713...
4    [<generator object ngrams at 0x000002A3A91874F...
Name: bigrams, dtype: object

2 个答案:

答案 0 :(得分:1)

此输出的原因隐藏在您要应用的lambda函数的主体中:

generic_tweets['bigrams'] = generic_tweets['tweet'].apply(lambda row: list(map(lambda x:ngrams(x,2), row))) 

我相信您应该做的是代替应用ngrams(x,2)list(ngrams(row,2)),这将摆脱您在答案中得到的生成器,并在单词而不是字母的水平上提供ngram :

generi_tweets['bigrams'] = df['tweet'].apply(lambda row: list(nltk.ngrams(row, 2)))

另一件事是,在不包含list的情况下从数据帧访问值也将公开ngrams函数的结果。

答案 1 :(得分:1)

如果您的熊猫系列没有数组形式,请在下面使用以获得二元组

function autoFillGoogleDocFromForm(e) {
 var timestamp = e.values[0];
 var firstName = e.values[1];
 var lastName = e.values[2];
 var title = e.values[3];

 var templateFile = DriveApp.getFolderById("1FbWjGH9phpWN2i4vGqWnvJyBG3HPc_uxkHfWsHzg00E");
 var templateResponseFolder = DriveApp.getFolderById("1rEGYnVhJ2vpG_AnqqrVFxPx5rWOmOUj9");

 var copy = templateFile.makeCopy(lastName + ', ' + firstName, templateResponseFolder);

 var doc = DocumentApp.openById(copy.getId());

 var body = doc.getBody();

 body.replaceText("{{FirstName}}", firstname);
 body.replaceText("{{LastName}}", lastname);
 body.replaceText("{{Title}}", title);

 doc.saveAndClose();

}

这类似于

generic_tweets['bigrams'] = generic_tweets['tweet'].apply(lambda row: list(nltk.bigrams(row.split(' '))))

输出将为

list(nltk.bigrams(['abc', 'def', 'ghi']))