Question

我有一个数据框df由2列组成（字的含义和意义/定义）。我希望将Collections.Counter对象用于单词的每个定义，并以尽可能最pythonic的方式计算定义中出现的单词的频率。

传统的方法是使用iterrows()方法迭代数据框并进行计算。

示例输出

＆＃13;

<table style="height: 59px;" border="True" width="340">
  <tbody>
    <tr>
      <td>Word</td>
      <td>Meaning</td>
      <td>Word Freq</td>
    </tr>
    <tr>
      <td>Array</td>
      <td>collection of homogeneous datatype</td>
      <td>{'collection':1,'of':1....}</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
    </tr>
  </tbody>
</table>

＆＃13;

Answer 1

我会利用Pandas str访问器方法并执行此操作

from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())

一些测试数据

df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})

print(df)
             definition   word
0  this is a definition   some
1    another definition  words
2  one final definition    yes

然后按空格连接和分割并使用Counter

Counter(df.definition.str.cat(sep=' ').split())

Counter({'a': 1,
         'another': 1,
         'definition': 3,
         'final': 1,
         'is': 1,
         'one': 1,
         'this': 1})

Answer 2

假设df有两列'word'和'definition'，那么您只需在.map系列中使用Counter方法definition分裂后的空间。然后对结果求和。

from collections import Counter

def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()

Answer 3

我打算让这个答案有用，但不是选择的答案。事实上，我只是为Counter和@ TedPetrou的答案辩护。

创建随机字词的大型示例

a = np.random.choice(list(ascii_lowercase), size=(100000, 5))

definitions = pd.Series(
    pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')

definitions.head()

0    hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1    iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2    ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3    uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4    npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object

<强> 定时
Counter比我想象的最快1000快一倍。

如何编写在数据帧python中为列添加值的最有效方法？

3 个答案: