我有以下数据框
MESSAGE DOCUMENT_ID
0 @Zuora wants to help @Network4Good with Hurricane and hurriacane... 263403828328665088
1 @ztrip please help spread the good word on hello and hello... 264142543883739136
2 #ZSwaggers @Zendaya96 did this,you should too. You... 265122997348753408
3 @Zendaya96 u have inspired me girl! So can eve... 265499798952628224
4 ''@Zendaya96 let's help the Hurricane Sandy vi... 265161977662435328
5 @Zendaya96 Help the hurricane Sandy victims . ... 265496790881669120
6 @Zendaya96 Help the hurricane Sandy victims¡¡ ... 265496111257624576
7 @Zendaya96 @bellathorne : Help the Hurricane ... 265192268137373696
8 Your Personal Discount Co... 263385298296270848
9 Your help is needed! Donate $10 to the America... 265578540001554432
如何使用MESSAGE中的单词数创建一个pandas数据框
例如
DOCUMENT_ID word count
263403828328665088 hurricane 2
263403828328665088 with 1
.........
264142543883739136 hello 2
...........
我尝试使用如下函数,但我不知道如何为每个单词附加DOCUMENT_ID:
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
答案 0 :(得分:0)
使用nltk
细分MESSAGE
,然后使用document_id和文字制作笛卡尔积,然后使用groupby
和count
。
import nltk
from itertools import product
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df["WORDS"] = df.MESSAGE.apply(nltk.word_tokenize)
document_id = df.DOCUMENT_ID.apply(lambda x: [str(x)])
Cartesian_product = map(lambda x: product(x[0], x[1]), zip(document_id, df.WORDS))
df2 = pd.DataFrame(reduce(lambda x,y:x+y, map(list, Cartesian_product)), columns=["DOCUMENT_ID", "WORD"])
result = df2.groupby(["DOCUMENT_ID", "WORD"])["DOCUMENT_ID"].count().reset_index(name="COUNT")
result = result[~result.WORD.isin(stop_words)]
result = result.sort_values(by=["DOCUMENT_ID", "COUNT"], ascending=[1,0])
result
输出
DOCUMENT_ID WORD COUNT
0 263385298296270848 ... 1
1 263385298296270848 Co 1
2 263385298296270848 Discount 1
3 263385298296270848 Personal 1
4 263385298296270848 Your 1
6 263403828328665088 @ 2
5 263403828328665088 ... 1
7 263403828328665088 Hurricane 1
8 263403828328665088 Network4Good 1
9 263403828328665088 Zuora 1
11 263403828328665088 help 1
12 263403828328665088 hurriacane 1
14 263403828328665088 wants 1
20 264142543883739136 hello 2
16 264142543883739136 ... 1
17 264142543883739136 @ 1
19 264142543883739136 good 1
21 264142543883739136 help 1
23 264142543883739136 please 1
24 264142543883739136 spread 1
26 264142543883739136 word 1
27 264142543883739136 ztrip 1
28 265122997348753408 # 1
29 265122997348753408 , 1
30 265122997348753408 . 1
31 265122997348753408 ... 1
32 265122997348753408 @ 1
33 265122997348753408 You 1
34 265122997348753408 ZSwaggers 1
35 265122997348753408 Zendaya96 1
... ... ... ...
63 265496111257624576 Sandy 1
64 265496111257624576 Zendaya96 1
65 265496111257624576 hurricane 1
67 265496111257624576 victims¡¡ 1
68 265496790881669120 . 1
69 265496790881669120 ... 1
70 265496790881669120 @ 1
71 265496790881669120 Help 1
72 265496790881669120 Sandy 1
73 265496790881669120 Zendaya96 1
74 265496790881669120 hurricane 1
76 265496790881669120 victims 1
77 265499798952628224 ! 1
78 265499798952628224 ... 1
79 265499798952628224 @ 1
80 265499798952628224 So 1
81 265499798952628224 Zendaya96 1
83 265499798952628224 eve 1
84 265499798952628224 girl 1
86 265499798952628224 inspired 1
88 265499798952628224 u 1
89 265578540001554432 ! 1
90 265578540001554432 $ 1
91 265578540001554432 ... 1
92 265578540001554432 10 1
93 265578540001554432 America 1
94 265578540001554432 Donate 1
95 265578540001554432 Your 1
96 265578540001554432 help 1
98 265578540001554432 needed 1