Question

我有一个数据框，其中column1包含文本数据，column2包含text的类别，包含在column1中。我想找到出现在文本数据中的一个类别（即非正式）中的单词，而不出现在其他类别中的单词。数据框中的多行将具有相同的类别。

        Textual                           Category 
Hi johnny how are you today              Informal 
Dear Johnny                              Formal
Hey Johnny                               Informal
To Johnny                                Formal

示例输出：

Informal: [Hi, how, are, you, today, Hey]
Formal: [Dear, To]

Answer 1

# Remove punctuation
df.Textual = df.Textual.str.replace('.', '')
df.Textual = df.Textual.str.replace(',', '')
df.Textual = df.Textual.str.replace('?', '')

# get list of all words per Category
df1 = df.groupby(['Category'])['Textual'].apply(' '.join).reset_index()
df1['Textual'] = df1.Textual.str.split().apply(lambda x: list(filter(None, list(set(x)))))
print(df1)

# Split the list in different columns
df = pd.DataFrame(df1.Textual.values.tolist(), index= df1.index)
print(df)

# Reshape the df to have a line for each word
df['Category'] = df1.Category
df = df.set_index("Category")
df = df.stack()
print(df)

# Drop word that are present in several Categories
df = df.str.upper().drop_duplicates(keep=False)
print(df)

# Reshape the df to the expected output
df = df.groupby('Category').apply(list)
print(df)

Answer 2

您可以通过groupby + to_dict创建字典。然后计算唯一值，并通过set和字典理解将其删除。请注意，与您的示例不同，我不进行任何案例检查，例如我认为约翰尼将永远有一个大写的J。

from collections import Counter
from itertools import chain

df = pd.DataFrame({'Textual': ['Hi Johnny how are you today', 'Dear Johnny', 'Hey Johnny', 'To Johnny'],
                   'Category': ['Informal', 'Formal', 'Informal', 'Formal']})

def return_unique(x):
    return list(set(' '.join(x.values).split()))

res = df.groupby('Category')['Textual'].apply(return_unique).to_dict()

c = Counter(chain.from_iterable(res.values())).items()

unique = {k for k, v in c if v == 1}

res = {k: list(set(v) & unique) for k, v in res.items()}

{'Formal': ['To', 'Dear'],
 'Informal': ['today', 'how', 'Hi', 'Hey', 'are', 'you']}

如何查找类别中的唯一单词-Python

2 个答案: