在pandas专栏中分组相似的单词/句子

时间:2018-04-08 09:51:21

标签: python pandas

我尝试使用代码:

Counter(" ".join(df["text"]).split()).most_common(100)

获得最常用的单词,但我想要的是句子中常用单词的数量。 例如:

1. A123 B234 C345 test data.
2. A123 B234 C345 D555 test data.
3. A123 B234 test data.
4. A123 B234 C345 more data.

我想要数:

 A123 B234 data- 4
 A123 B234 test data - 3
 A123 B234 C345 test data- 3

我正在寻找一组常见且数量很多的单词。我怎样才能在pandas / python中得到它

例句:

Money transferred from xyz@abc.com to account no.123
Money transferred from xyz@abc.net to account no.abc
Money failed transferring from xyz@abc. to account no.cde
Money transferred from example@yyy.com to account no.www
Money failed transferring from xyz@abc.com to account no.ttt

2 个答案:

答案 0 :(得分:0)

使用groupby作为输入,然后使用size方法

df.groupby(['col1','col2','col3']).size().sort_values()

答案 1 :(得分:0)

一种可能的解决方案:

df = df['col'].str.get_dummies(' ')
print (df)
   A123  B234  C345  D555  data  more  test
0     1     1     1     0     1     0     1
1     1     1     1     1     1     0     1
2     1     1     0     0     1     0     1
3     1     1     1     0     1     1     0

替代:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['col'].str.split()),
                  columns=mlb.classes_, 
                  index=df.index)
print (df)
   A123  B234  C345  D555  data  more  test
0     1     1     1     0     1     0     1
1     1     1     1     1     1     0     1
2     1     1     0     0     1     0     1
3     1     1     1     0     1     1     0

获取所有列组合min_lengthmax的所有组合(words):

from  itertools import combinations
a = df.columns
min_length = 3
comb = [j for i in range(len(a), min_length -1, -1) for j in combinations(a,i)]

在列表理解计数值中:

df1 = pd.DataFrame([(', '.join(x), df.loc[:, x].all(axis=1).sum(), len(x)) for x in comb], 
                    columns=['words','count','len'])

TOP = 2
TOP_count = sorted(df1['count'].unique())[-TOP:]
df1 = df1[df1['count'].isin(TOP_count)].sort_values(['count', 'len'], ascending=False)
print (df1)
                     words  count  len
66        A123, B234, data      4    3
30  A123, B234, C345, data      3    4
37  A123, B234, data, test      3    4
64        A123, B234, C345      3    3
68        A123, B234, test      3    3
70        A123, C345, data      3    3
77        A123, data, test      3    3
80        B234, C345, data      3    3
87        B234, data, test      3    3

编辑:

Pure python解决方案:

from  itertools import combinations, takewhile
from collections import Counter

min_length = 3
d = Counter()
for a in df['col'].str.split():
    for i in range(len(a), min_length -1, -1):
        for j in combinations(a,i):
            d[j] +=1
#print (d)

#https://stackoverflow.com/a/26831143
def get_items_upto_count(dct, n):
  data = dct.most_common()
  val = data[n-1][1] #get the value of n-1th item
  #Now collect all items whose value is greater than or equal to `val`.
  return list(takewhile(lambda x: x[1] >= val, data))

L = get_items_upto_count(d, 2)

s = pd.DataFrame(L, columns=['val','count'])
print (s)
                        val  count
0        (A123, B234, data)      4
1  (A123, B234, C345, data)      3
2  (A123, B234, test, data)      3
3        (A123, B234, C345)      3
4        (A123, B234, test)      3
5        (A123, C345, data)      3
6        (A123, test, data)      3
7        (B234, C345, data)      3
8        (B234, test, data)      3