Question

我有一个pandas数据框，其中包含给定时间段内的查询和计数，我希望将此数据帧转换为唯一字数。例如，如果数据框包含以下内容：

query          count
foo bar        10
super          8 
foo            4
super foo bar  2

我希望收到以下数据框。例如“foo”这个词在表格中恰好出现了16次。

word    count
foo     16
bar     12
super   10

我正在使用下面的函数，但它似乎不是最佳方法，它忽略了每行的总计数。

def _words(df):
  return Counter(re.findall(r'\w+', ' '.join(df['query'])))

任何帮助将不胜感激。

提前致谢！

Answer 1

选项1

df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

bar      12
foo      16
super    10
dtype: int64

选项2

df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()

bar      12
foo      16
super    10
dtype: int64

选项3
numpy.bincount + pd.factorize
还强调使用cytoolz.mapcat。它返回一个迭代器，它映射一个函数并连接结果。太酷了！

import pandas as pd, numpy as np, cytoolz

q = df['query'].values
c = df['count'].values

f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist())))
l = np.core.defchararray.count(q.astype(str), ' ') + 1

pd.Series(np.bincount(f, c.repeat(l)).astype(int), u)

foo      16
bar      12
super    10
dtype: int64

选项4
荒谬使用东西......只需使用选项1.

pd.DataFrame(dict(
    query=' '.join(df['query']).split(),
    count=df['count'].repeat(df['query'].str.count(' ') + 1)
)).groupby('query')['count'].sum()

query
bar      12
foo      16
super    10
Name: count, dtype: int64

Answer 2

另一种选择{ "$schema": "../../schémas/json/paquet.json", "marches" : [.[] | { id: ."Nmarché", acheteur: { id: .SIRETMandataire, nom: .LibelleEntiteMandataire }, nature: .Nature, objet: .Objet, codeCPV: .CodeCPV, procedure: $procedures[.Procedure], lieuExecution: { code: ( .CodeINSEEExecution //.CodePostalCommuneExecution), nom: .NomCommuneExecution, typeCode: (if .CodeINSEEExecution != null then "Code commune" elif .CodePostalCommuneExecution != null then "Code postal" else null end) }, dateNotification: .DateNotification, montant: ."Montant Attribue HT", dureeMois: null, titulaires: { id: .SIRETContractant, denominationSociale: .DenominationSociale } } ] } + melt + groupby：

sum

计算Pandas中一列字符串中的单词

2 个答案: