我是Python的新手,所以很可能是我没有正确地写这个词来找到答案。
使用熊猫,我能够在数据的描述字段中为每条记录找到最频繁的N个单词。但是,我有两列;分类列和说明字段。如何找到每个类别中最常见的单词?
Ex数据:
- Property|Description
- House| Blue, Two stories, pool
- Car| Green, Dented, Manual, New
- Car| Blue, Automatic, Heated Seat
- House|New, Furnished, HOA
- Car|Blue, Old, Multiple Owners
我当前的代码将返回Blue = 3,New = 2等。但是我需要知道的是,Blue两次出现在Car一词上,而House出现了一次。
当前相关代码
words = (data.Description.str.lower().str.cat(sep=' ').split())
keywords=pandas.DataFrame(Counter(words).most_common(10), columns=['Words', 'Frequency'])
答案 0 :(得分:1)
尝试此操作,按行距split行值,然后应用explode将类似列表的每个元素转换为行,最后Groupby
def foo(integrand,var):
anti = integrate(integrand,var)
ouptut,
# remove leading white space's & split by delimiter
df['Description'] = df['Description'].str.strip()\
.str.replace(",\s+", ",")\
.str.split(',')
# apply group by to get count of each word.
print(df.explode(column='Description').
groupby(["Property","Description"]).size().reset_index(name='count'))
答案 1 :(得分:1)
数据
df=pd.DataFrame({'Property':['House','Car','Car','House','Car'],'Description':['Blue,Two stories,pool','Green,Dented,Manual,New','Blue,Automatic,Heated Seat','Blue,Furnished,HOA','Blue,Old,Multiple Owners']})
链式解决方案 df.assign(words=df.Description.str.lower().str.split(',')).explode('words').groupby('Property')['words'].value_counts()
详细解释
#Create list
df['words'] = df.Description.str.lower().str.split(',')
#Explode and count
df=df.explode('words').groupby('Property')['words'].value_counts()
Property words
Car blue 2
automatic 1
dented 1
green 1
heated seat 1
manual 1
multiple owners 1
new 1
old 1
House blue 2
furnished 1
hoa 1
pool 1
two stories 1
Name: words, dtype: int64
答案 2 :(得分:0)
在计数之前使用groupby:doc。 然后您可以指望每个组
df = pd.DataFrame(...)
groups = df.groupby(['column_name'])
for group in groups:
do_counting()