构建Wordnet.Synsets()。Definition()的列表理解时发生AttributeError

时间:2018-09-18 17:51:29

标签: python pandas list-comprehension attributeerror wordnet

首先,我是一个python noob,我只是半个不明白,有些东西是如何工作的。我一直在尝试为标签项目构建词矩阵,希望自己能解决这个问题,但是我没有看到很多关于我的特定错误的文档。因此,如果这太明显了,我先向您道歉。

我试图获得一组可以在几种不同的变体中工作的函数,但是我不断收到“ AttributeError:'list'没有属性定义。”

import pandas as pd
from pandas import DataFrame, Series
import nltk.data
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer

# Gets synsets for a given term.

def get_synset(word):
    for word in wn.synsets(word):
        return word.name()

#Gets definitions for a synset.

def get_def(syn):
    return wn.synsets(syn).defnition()

# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.

def sector_tagger(frame):
    sentences = frame.tolist()
    tok_list = [tok.tokenize(w) for w in frame]
    split_words = [w.lower() for sub in tok_list for w in sub]
    clean_words = [w for w in split_words if w not in english_stops]
    synset = [get_synset(w) for w in clean_words]
    sector_matrix = DataFrame({'Categories': clean_words,
                               'Synsets': synset})
    sec_syn = sector_matrix['Synsets'].tolist()
    sector_matrix['Definition'] = [get_def(w) for w in sector_matrix['Synsets']]
    return sector_matrix

这些函数在我从excel中读取的数据帧上被调用:

test = pd.read_excel('data.xlsx')

sector_tagger函数的调用方式如下:

agri_matrix = sector_tagger(agri['Category'])

列表理解中的一个名为wn.synsets(w).definition()的早期版本填充了DataFrame。另一位试图在Jupyter Notebook中找到事实之后调用该定义。我几乎总是收到属性错误。就是说,当我在segment_matrix ['Synsets']上调用数据类型时,得到的是“对象”类型,而在打印该列时,在这些项目周围看不到[]。

我尝试过:

  • 在str()中包装“ w”
  • 呼入和呼出列表理解 功能(即删除行并在我的笔记本中调用)
  • 将“同义词集”列传递到新列表并围绕该列表建立列表理解

奇怪的是,我昨天在玩这个游戏,能够直接在笔记本上工作,但是(a)混乱(b)没有可扩展性,(c)在其他类别上不起作用我将其应用于。

agrimask = (df['Agri-Food']==1) & (df['Total']==1)
df_agri = df.loc[agrimask,['Category']]
agri_words = [tok.tokenize(a) for a in df_agri['Category']]
agri_cip_words = [a.lower() for sub in agri_words for a in sub]
agri_clean = [w for w in agri_cip_words if w not in english_stops]
df_agri_clean = DataFrame({'Category': agri_clean})
df_agri_clean = df_agri_clean[df_agri_clean != ','].replace('horticulture/horticultural','horticulture').dropna().drop_duplicates()
df_agri_clean['Synsets'] = [x[0].name() for x in df_agri_clean['Category'].apply(syn)]
df_agri_clean['Definition'] = [wn.synset(x).definition() for x in df_agri_clean['Synsets']]
df_agri_clean['Lemma'] = [wn.synset(x).lemmas()[0].name() for x in df_agri_clean['Synsets']]
df_agri_clean

Edit1:这是指向sample of the data的链接。

Edit2:另外,我正在使用的静态变量在这里(全部基于标准NLTK库):

tok = TreebankWordTokenizer()
english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))

Edit3:您可以在此处查看此代码的有效版本:Working Code

1 个答案:

答案 0 :(得分:1)

2018-09-18_CIP.ipynb

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer as tok

english_stops = set(stopwords.words('english'))

# Gets synsets for a given term.
def get_synset(word):
    for word in wn.synsets(word):
        return word.name()

#Gets definitions for a synset.
def get_def(syn):
    return wn.synset(syn).definition()  # your definition is misspelled

# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
    tok_list = tok().tokenize(frame)
    split_words = [w.lower() for w in tok_list]
    clean_words = [w for w in split_words if w not in english_stops]
    synset = [get_synset(w) for w in clean_words]
    sector_matrix = pd.DataFrame({'Categories': clean_words,
                                  'Synsets': synset})
    sec_syn = list(sector_matrix['Synsets'])
    sector_matrix['Definition'] = [get_def(w) if w != None else '' for w in sec_syn]
    return sector_matrix

agri_matrix = df['Category'].apply(sector_tagger)

if this answers your question, please check it as the answer

The output of get_def is a list of phrases

Alternate Approach

def sector_tagger(frame):
    mapping = [('/', ' '), ('(', ''), (')', ''), (',', '')]
    for k, v in mapping:
        frame = frame.replace(k, v)
    tok_list = tok().tokenize(frame)  # note () after tok
    split_words = [w.lower() for w in tok_list]
    clean_words = [w for w in split_words if w not in english_stops]
    synset = [get_synset(w) for w in clean_words]
    def_matrix = [get_def(w) if w != None else '' for w in synset]
    return clean_words, synset, def_matrix


poo = df['Category'].apply(sector_tagger)

poo[0] = 
(['agricultural', 'domestic', 'animal', 'services'],
 ['agricultural.a.01', 'domestic.n.01', 'animal.n.01', 'services.n.01'],
 ['relating to or used in or promoting agriculture or farming',
  'a servant who is paid to perform menial tasks around the household',
  'a living organism characterized by voluntary movement',
  'performance of duties or provision of space and equipment helpful to others'])

list_clean_words = []
list_synset = []
list_def_matrix = []
for x in poo:
    list_clean_words.append(x[0])
    list_synset.append(x[1])
    list_def_matrix.append(x[2])

agri_matrix = pd.DataFrame()
agri_matrix['Categories'] = list_clean_words
agri_matrix['Synsets'] = list_synset
agri_matrix['Definition'] = list_def_matrix
agri_matrix

                                    Categories      Synsets       Definition
0   [agricultural, domestic, animal, services]  [agricultural.a.01, domestic.n.01, animal.n.01...   [relating to or used in or promoting agricultu...
1   [agricultural, food, products, processing]  [agricultural.a.01, food.n.01, merchandise.n.0...   [relating to or used in or promoting agricultu...
2   [agricultural, business, management]    [agricultural.a.01, business.n.01, management....   [relating to or used in or promoting agricultu...
3   [agricultural, mechanization]   [agricultural.a.01, mechanization.n.01] [relating to or used in or promoting agricultu...
4   [agricultural, production, operations]  [agricultural.a.01, production.n.01, operation...   [relating to or used in or promoting agricultu...

Split each list of lists into a long list (they're ordered)

def create_long_list_from_list_of_lists(list_of_lists):
    long_list = []
    for one_list in list_of_lists:
        for word in one_list:
            long_list.append(word)
    return long_list

long_list_clean_words = create_long_list_from_list_of_lists(list_clean_words)
long_list_synset = create_long_list_from_list_of_lists(list_synset)
long_list_def_matrix = create_long_list_from_list_of_lists(list_def_matrix)

Turn it into a DataFrame of Uniques Categories

agri_df = pd.DataFrame.from_dict(dict([('Categories', long_list_clean_words), ('Synsets', long_list_synset), ('Definitions', long_list_def_matrix)])).drop_duplicates().reset_index(drop=True)

agri_df.head(4)

       Categories              Synsets                         Definitions
0   ceramic               ceramic.n.01  an artifact made of hard brittle material prod...
1   horticultural   horticultural.a.01  of or relating to the cultivation of plants
2   construction     construction.n.01  the act of constructing something
3   building             building.n.01  a structure that has a roof and walls and stan...

Final Note

import from nltk.tokenize import TreebankWordTokenizer as tok

or:

import from nltk.tokenize import word_tokenize

to use:

tok().tokenize(string_text_phrase)  # text is a string phrase, not a list of words

or:

word_tokenize(string_text_phrase)

Both methods appear to produce the same output, which is a list of words.

input = "Agricultural and domestic animal services"

output_of_both_methods = ['Agricultural', 'and', 'domestic', 'animal', 'services']