我在pandas df中有一个列,它已使用:
进行了标记 def save(self, *a, **kw):
super().save(*a, exclude=['reported'], **kw)
self.hash_id = hash_fn(self.pk)
super().save(*a, **kw)
现在我尝试使用以下方式标记这些标记化的字词:
df['token_col'] = df.col.apply(word_tokenize)
但我收到的错误我无法理解:
df['pos_col'] = nltk.tag.pos_tag(df['token_col'])
df['wordnet_tagged_pos_col'] = [(w,get_wordnet_pos(t)) for (w, t) in (df['pos_col'])]
如果它有所作为,我的下一步将是使用以下标记的标记进行lematizing:
AttributeError Traceback (most recent call last)
<ipython-input-28-99d28433d090> in <module>()
1 #tag tokenized lists
----> 2 df['pos_col'] = nltk.tag.pos_tag(df['token_col'])
3 df['wordnet_tagged_pos_col'] = [(w,get_wordnet_pos(t)) for (w, t) in (df['pos_col'])]
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset, lang)
125 """
126 tagger = _get_tagger(lang)
--> 127 return _pos_tag(tokens, tagset, tagger)
128
129
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\__init__.py in _pos_tag(tokens, tagset, tagger)
93
94 def _pos_tag(tokens, tagset, tagger):
---> 95 tagged_tokens = tagger.tag(tokens)
96 if tagset:
97 tagged_tokens = [(token, map_tag('en-ptb', tagset, tag)) for (token, tag) in tagged_tokens]
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in tag(self, tokens)
150 output = []
151
--> 152 context = self.START + [self.normalize(w) for w in tokens] + self.END
153 for i, word in enumerate(tokens):
154 tag = self.tagdict.get(word)
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in <listcomp>(.0)
150 output = []
151
--> 152 context = self.START + [self.normalize(w) for w in tokens] + self.END
153 for i, word in enumerate(tokens):
154 tag = self.tagdict.get(word)
C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in normalize(self, word)
236 if '-' in word and word[0] != '-':
237 return '!HYPHEN'
--> 238 elif word.isdigit() and len(word) == 4:
239 return '!YEAR'
240 elif word[0].isdigit():
AttributeError: 'list' object has no attribute 'isdigit'
我的df超过70列,所以这是一个小快照:
df['lmtzd_col'] = [(lmtzr.lemmatize(w, pos=t if t else 'n').lower(),t) for (w,t) in wordnet_tagged_pos_col]
print(len(set(wordnet_tagged_pos_col)),(len(set(df['lmtzd_col']))))
答案 0 :(得分:1)
您可以使用apply来获取词性标签,即
new int [10]
0 [(Assessment, NNP), ( of, NNP), ( Improvement,... 1 [(A, DT), ( member, NNP), ( of, NNP), ( the, N... 2 [(During, IN), ( our, JJ), ( second, NN), ( an... Name: pos_col, dtype: object
类似地,你更好地使用df['pos_col'] = df['token_col'].apply(nltk.tag.pos_tag)
df['pos_col']
函数与lambda在每一行上应用函数而不是将函数传递给函数,如
apply
因为您需要在列的每个单元格上应用get_wordnet_pos。
df['wordnet_tagged_pos_col'] = df['pos_col'].apply(lambda x : [(w,get_wordnet_pos(t)) for (w, t) in x],1)
0 [(Assessment, (N, n)), ( of, (N, n)), ( Improv... 1 [(A, (D, n)), ( member, (N, n)), ( of, (N, n))... 2 [(During, (I, n)), ( our, (J, a)), ( second, (... Name: wordnet_tagged_pos_col, dtype: object
希望它有所帮助。