Question

我在数据框中有一个包含列表的列。从下图可以看到。

我想知道如何从该列中提取所有单词而没有任何重复的单词，并且需要遍历从0到len(uniquewordlist)的唯一单词列表，并为每个单词分配一个值基于我参与的迭代。

感谢您的帮助。

Answer 1

这就是您数据的样子！

import pandas as pd
df = pd.DataFrame([[['kubernetes', 'client', 'bootstrapping', 'ponda']], [['micro', 'insu']], [['motor', 'upi']],[['secure', 'app', 'installation']],[['health', 'insu', 'express', 'credit', 'customer']],[['secure', 'app', 'installation']],[['aap', 'insta']],[['loan', 'house', 'loan', 'customers']]])

df.columns = ['ingredients']

print(df)

输出：

                                 ingredients
0  [kubernetes, client, bootstrapping, ponda]
1                               [micro, insu]
2                                [motor, upi]
3                 [secure, app, installation]
4   [health, insu, express, credit, customer]
5                 [secure, app, installation]
6                                [aap, insta]
7              [loan, house, loan, customers]

这里是带出唯一单词列表的代码。

for i in df.index:

    df.at[i, 'string'] = " ".join(item for item in df.at[i, 'ingredients'])

df.drop(['ingredients'], axis = 1, inplace = True)

from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

counts = countvec.fit_transform(df['string'])

vocab = pd.DataFrame(counts.toarray())
vocab.columns = countvec.get_feature_names()

print(list(vocab.columns))

给予

['aap', 'app', 'bootstrapping', 'client', 'credit', 'customer', 'customers', 'express', 'health', 'house', 'insta', 'installation', 'insu', 'kubernetes', 'loan', 'micro', 'motor', 'ponda', 'secure', 'upi']

您现在拥有一个独特词汇表。如果您可以进一步说明如何分配值，我可以继续回答。

扩展答案：

wordlist = list(vocab.columns)


worddict = {}

for i in range(0, len(wordlist)):

    worddict[wordlist[i]] = i

print(worddict)

Answer 2

您可以在字典理解中使用enumerate和itertools.chain。 set确保映射是唯一的。

来自@Abhishek的数据。

from itertools import chain

res = {v: k for k, v in enumerate(set(chain.from_iterable(df['ingredients'])))}

print(res)

{'aap': 15,
 'app': 3,
 'bootstrapping': 1,
 ...
 'ponda': 0,
 'secure': 17,
 'upi': 5}

Answer 3

您可以使用不同的衬纸获得@jpp的答案（也适用于数据框）：

import pandas as pd
from collections import Counter
s = pd.Series([['apple', 'orange', 'raspberry'],
               ['apple', 'cucumber', 'strawberry', 'orange']])
s.apply(Counter).sum()

Counter({'apple': 2,
     'cucumber': 1,
     'orange': 2,
     'raspberry': 1,
     'strawberry': 1})

如果您使用

list(s.apply(Counter).sum().keys())

您得到的正是@Abhishek的答案，我认为这更容易理解。由于没有为集合定义+，因此无法应用set

将值分配给pandas数据框中的不可散列列表

3 个答案: