字符串拆分并为Pandas DataFrame分配唯一ID

时间:2018-04-25 00:23:40

标签: regex python-3.x pandas numpy

我有以下数据框

MESSAGE                                                     DOCUMENT_ID
0   @Zuora wants to help @Network4Good with Hurricane and hurriacane... 263403828328665088
1   @ztrip please help spread the good word on hello and hello...   264142543883739136
2   #ZSwaggers @Zendaya96 did this,you should too. You...   265122997348753408
3   @Zendaya96 u have inspired me girl! So can eve...   265499798952628224
4   ''@Zendaya96 let's help the Hurricane Sandy vi...   265161977662435328
5   @Zendaya96 Help the hurricane Sandy victims . ...   265496790881669120
6   @Zendaya96 Help the hurricane Sandy victims¡¡ ...   265496111257624576
7   @Zendaya96 @bellathorne : Help the Hurricane ...    265192268137373696
8   Your Personal  Discount Co...   263385298296270848
9   Your help is needed! Donate $10 to the America...   265578540001554432

如何使用MESSAGE中的单词数创建一个pandas数据框

例如

DOCUMENT_ID        word      count
263403828328665088 hurricane 2
263403828328665088 with      1
.........
264142543883739136 hello     2
...........

我尝试使用如下函数,但我不知道如何为每个单词附加DOCUMENT_ID:

def wordsplit(wordlist):
    j=wordlist
    j=re.sub(r'\d+', '', j)
    j=re.sub('RT', '',j)
    j=re.sub('http', '', j)
    j = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
    j=j.lower()
    j=j.strip()
    if not j in stopwords.words('english'):
        yield j

def wordSplitCount(wordlist):
    '''merges a list into string, splits it, removes stop words and 
    then counts the occurrences returning an ordered dictitonary'''
    #stopwords=set(stopwords.words('english'))
    string1=''.join(list(itertools.chain(filter(None, wordlist))))
    cnt=Counter()
    j = []
    for i in string1.split(" "):
        i=re.sub(r'&', ' ', i.lower())
        if i not in stopwords.words('english'):
            cnt[i]+=1
    return OrderedDict(cnt)

def sortedValues(wordlist):
    '''creates a dictionary list of occurenced w/ values descending'''
    d=wordSplitCount(wordlist)
    return sorted(d.items(), key=lambda t: t[1], reverse=True)

1 个答案:

答案 0 :(得分:0)

使用nltk细分MESSAGE,然后使用document_id和文字制作笛卡尔积,然后使用groupbycount

import nltk  
from itertools import product
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

df["WORDS"] = df.MESSAGE.apply(nltk.word_tokenize)
document_id = df.DOCUMENT_ID.apply(lambda x: [str(x)])
Cartesian_product = map(lambda x: product(x[0], x[1]), zip(document_id, df.WORDS))

df2 = pd.DataFrame(reduce(lambda x,y:x+y, map(list, Cartesian_product)), columns=["DOCUMENT_ID", "WORD"])
result = df2.groupby(["DOCUMENT_ID", "WORD"])["DOCUMENT_ID"].count().reset_index(name="COUNT")
result = result[~result.WORD.isin(stop_words)]
result = result.sort_values(by=["DOCUMENT_ID", "COUNT"], ascending=[1,0])
result

输出

    DOCUMENT_ID WORD    COUNT
0   263385298296270848  ... 1
1   263385298296270848  Co  1
2   263385298296270848  Discount    1
3   263385298296270848  Personal    1
4   263385298296270848  Your    1
6   263403828328665088  @   2
5   263403828328665088  ... 1
7   263403828328665088  Hurricane   1
8   263403828328665088  Network4Good    1
9   263403828328665088  Zuora   1
11  263403828328665088  help    1
12  263403828328665088  hurriacane  1
14  263403828328665088  wants   1
20  264142543883739136  hello   2
16  264142543883739136  ... 1
17  264142543883739136  @   1
19  264142543883739136  good    1
21  264142543883739136  help    1
23  264142543883739136  please  1
24  264142543883739136  spread  1
26  264142543883739136  word    1
27  264142543883739136  ztrip   1
28  265122997348753408  #   1
29  265122997348753408  ,   1
30  265122997348753408  .   1
31  265122997348753408  ... 1
32  265122997348753408  @   1
33  265122997348753408  You 1
34  265122997348753408  ZSwaggers   1
35  265122997348753408  Zendaya96   1
... ... ... ...
63  265496111257624576  Sandy   1
64  265496111257624576  Zendaya96   1
65  265496111257624576  hurricane   1
67  265496111257624576  victims¡¡   1
68  265496790881669120  .   1
69  265496790881669120  ... 1
70  265496790881669120  @   1
71  265496790881669120  Help    1
72  265496790881669120  Sandy   1
73  265496790881669120  Zendaya96   1
74  265496790881669120  hurricane   1
76  265496790881669120  victims 1
77  265499798952628224  !   1
78  265499798952628224  ... 1
79  265499798952628224  @   1
80  265499798952628224  So  1
81  265499798952628224  Zendaya96   1
83  265499798952628224  eve 1
84  265499798952628224  girl    1
86  265499798952628224  inspired    1
88  265499798952628224  u   1
89  265578540001554432  !   1
90  265578540001554432  $   1
91  265578540001554432  ... 1
92  265578540001554432  10  1
93  265578540001554432  America 1
94  265578540001554432  Donate  1
95  265578540001554432  Your    1
96  265578540001554432  help    1
98  265578540001554432  needed  1