除了计算文档中单词的频率外,我还要计算单词与之关联的不同ID的数量。用一个例子来解释更容易:
from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
'words' : Series(["apple banana apple strawberry banana lemon",
"apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
wc[word] += 1
# defaultdict(<type 'int'>, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
"count": Series(wc.values())}
dfwc = DataFrame(dwc)
>>> dfwc
count word
0 2 kiwi
1 1 strawberry
2 3 lemon
3 3 apple
4 4 banana
计算单词部分的频率很简单,如上所示。我想要做的是获得如下输出,它给出了与每个单词相关的不同ID的数量:
count word ids
0 2 kiwi 1
1 1 strawberry 1
2 3 lemon 2
3 3 apple 1
4 4 banana 3
理想情况下,我希望它与计算单词频率同时进行..但我不确定如何整合它。
任何指针都会非常感激!
答案 0 :(得分:1)
我对大熊猫没有太多经验,但你可以做这样的事情。这个方法保存一个dict,其中键是单词,值是每个单词出现的所有ID的集合。
wc = defaultdict(int)
idc = defaultdict(set)
for ID, words in zip(df.ID, df.words):
lwords = words.split()
for word in lwords:
wc[word] += 1
# You don't really need the if statement (since a set will only hold one
# of each ID at most) but I feel like it makes things much clearer.
if ID not in idc[word]:
idc[word].add(ID)
此idc看起来像:
defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})
所以你必须得到每组的长度。我用过这个:
lenidc = dict((key, len(value)) for key, value in idc.iteritems())
在添加lenidc.values()作为dwc的关键字并初始化dfwc后,我得到了:
count ids word
0 2 1 kiwi
1 1 1 strawberry
2 3 2 lemon
3 3 1 apple
4 4 3 banana
这种方法的陷阱是它使用两个单独的词组(wc和idc),并且它们中的键(词)不保证是相同的顺序。因此,您需要将dicts合并在一起以消除此问题。我就这样做了:
# Makes it so the values in the wc dict are a tuple in
# (word_count, id_count) form
for key, value in lenidc.iteritems():
wc[key] = (wc[key], value)
# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively.
dwc = {"word": Series(wc.keys()),
"count": Series([v[0] for v in wc.values()]),
"ids": Series([v[1] for v in wc.values()])}
答案 1 :(得分:0)
这可能是一种更为流畅的方式,但我会分两步来处理它。首先,展平它,然后使用我们想要的信息创建一个新的数据框:
# make a new, flattened object
s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
index = s.index.get_level_values(0)
new = df.ix[index]
new["words"] = s.values
# now group and build
grouped = new.groupby("words")["ID"]
summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
summary = summary.reset_index().rename(columns={"words": "word"})
产生
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
步骤一步。我们从原始的DataFrame开始:
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
拉开多果元素:
>>> s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
>>> s
0 0 apple
1 banana
2 apple
3 strawberry
4 banana
5 lemon
1 0 apple
2 0 banana
3 0 banana
1 lemon
4 0 kiwi
5 0 kiwi
1 lemon
dtype: object
获取将这些与原始帧对齐的索引:
>>> index = s.index.get_level_values(0)
>>> index
Int64Index([0, 0, 0, 0, 0, 0, 1, 2, 3, 3, 4, 5, 5], dtype=int64)
然后从这个角度看原始框架:
>>> new = df.ix[index]
>>> new["words"] = s.values
>>> new
ID words
0 a apple
0 a banana
0 a apple
0 a strawberry
0 a banana
0 a lemon
1 a apple
2 b banana
3 c banana
3 c lemon
4 c kiwi
5 c kiwi
5 c lemon
这更像是我们可以使用的东西。根据我的经验,一半的努力是让您的数据以正确的格式开始。在此之后,很容易:
>>> grouped = new.groupby("words")["ID"]
>>> summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
>>> summary
count ids
words
apple 3 1
banana 4 3
kiwi 2 1
lemon 3 2
strawberry 1 1
>>> summary = summary.reset_index().rename(columns={"words": "word"})
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
请注意,我们只需使用.describe()
:
>>> new.groupby("words")["ID"].describe()
words
apple count 3
unique 1
top a
freq 3
banana count 4
unique 3
top a
freq 2
kiwi count 2
unique 1
top c
freq 2
lemon count 3
unique 2
top c
freq 2
strawberry count 1
unique 1
top a
freq 1
dtype: object
我们也可以从此开始,然后转动以获得所需的输出。