如何获得按第二个变量分组的单词频率计数(Python)

时间:2020-06-06 02:03:05

标签: python pandas

我是Python的新手,所以很可能是我没有正确地写这个词来找到答案。

使用熊猫,我能够在数据的描述字段中为每条记录找到最频繁的N个单词。但是,我有两列;分类列和说明字段。如何找到每个类别中最常见的单词?

Ex数据:

 - Property|Description
 - House| Blue, Two stories, pool
 - Car| Green, Dented, Manual, New
 - Car| Blue, Automatic, Heated Seat
 - House|New, Furnished, HOA
 - Car|Blue, Old, Multiple Owners

我当前的代码将返回Blue = 3,New = 2等。但是我需要知道的是,Blue两次出现在Car一词上,而House出现了一次。

当前相关代码

words = (data.Description.str.lower().str.cat(sep=' ').split())
keywords=pandas.DataFrame(Counter(words).most_common(10), columns=['Words', 'Frequency'])


3 个答案:

答案 0 :(得分:1)

尝试此操作,按行距split行值,然后应用explode将类似列表的每个元素转换为行,最后Groupby

def foo(integrand,var):
    anti = integrate(integrand,var)

ouptut,

# remove leading white space's & split by delimiter
df['Description'] = df['Description'].str.strip()\
    .str.replace(",\s+", ",")\
    .str.split(',')

# apply group by to get count of each word.
print(df.explode(column='Description').
      groupby(["Property","Description"]).size().reset_index(name='count'))

答案 1 :(得分:1)

数据

df=pd.DataFrame({'Property':['House','Car','Car','House','Car'],'Description':['Blue,Two stories,pool','Green,Dented,Manual,New','Blue,Automatic,Heated Seat','Blue,Furnished,HOA','Blue,Old,Multiple Owners']})

链式解决方案 df.assign(words=df.Description.str.lower().str.split(',')).explode('words').groupby('Property')['words'].value_counts()

详细解释

#Create list

df['words'] = df.Description.str.lower().str.split(',')

 #Explode and count

df=df.explode('words').groupby('Property')['words'].value_counts()

Property  words          
Car       blue               2
          automatic          1
          dented             1
          green              1
          heated seat        1
          manual             1
          multiple owners    1
          new                1
          old                1
House     blue               2
          furnished          1
          hoa                1
          pool               1
          two stories        1
Name: words, dtype: int64

答案 2 :(得分:0)

在计数之前使用groupby:doc。 然后您可以指望每个组

df = pd.DataFrame(...)

groups = df.groupby(['column_name'])
for group in groups:
    do_counting()