从pandas列创建angrams列表

时间:2016-03-03 06:30:24

标签: python pandas

我有一个示例数据框如下:

df = pd.DataFrame({
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']), 
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]), 
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan])
})

看起来如下:

name                                    notes                                       occupation
NaN                     meth cook makes meth with purity of over 96%                meth cook   
Walter White            meth cook is also called Heisenberg                             NaN
NaN                     meth cook has cancer                                            NaN
NaN                     he is known as the best meth cook                               NaN
NaN                     Meth Dealer added chili powder to his batch                     NaN
NaN                     Meth Dealer learned to make the best meth                       NaN
Jessie Pinkman          everyone goes to this Meth Dealer for best shot             meth dealer
NaN                     girlfriend of the meth dealer died                              NaN
Saul Goodman            this lawyer is a people pleasing person                         NaN
NaN                     cinnabon has now hired the lawyer as a baker                  lawyer
NaN                     lawyer had to take off in the end                               NaN
NaN                     lawyer has a lot of connections who knows other guy             NaN

我想创建一个单词/字谜列表来自' notes'柱。我还想从'笔记'中排除任何数字/特殊字符。列(例如:我不希望输出中有96%)。

我还想将所有单词(没有重复)写入文本文件。

我怎样才能在Python中执行此操作?

1 个答案:

答案 0 :(得分:2)

IIUC您可以使用str.replace删除数字和特殊字符:

import pandas as pd
import numpy as np

df = pd.DataFrame({
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']), 
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]), 
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan])
})

#remove all numbers and #*
df['notes'] = df['notes'].str.replace(r"[0-9%*]+","")
print df
              name                                              notes  \
0              NaN          meth cook makes meth with purity of over    
1     Walter White                meth cook is also called Heisenberg   
2              NaN                               meth cook has cancer   
3              NaN                  he is known as the best meth cook   
4              NaN        Meth Dealer added chili powder to his batch   
5              NaN          Meth Dealer learned to make the best meth   
6   Jessie Pinkman    everyone goes to this Meth Dealer for best shot   
7              NaN                 girlfriend of the meth dealer died   
8     Saul Goodman            this lawyer is a people pleasing person   
9              NaN       cinnabon has now hired the lawyer as a baker   
10             NaN                  lawyer had to take off in the end   
11             NaN  lawyer has a lot of connections who knows othe...   

     occupation  
0     meth cook  
1           NaN  
2           NaN  
3           NaN  
4           NaN  
5           NaN  
6   meth dealer  
7           NaN  
8           NaN  
9        lawyer  
10          NaN  
11          NaN 
#all string to one big string
l = df['notes'].sum()
print l
meth cook makes meth with purity of over meth cook is also called Heisenbergmeth cook has cancerhe is known as the best meth cookMeth Dealer added chili powder to his batchMeth Dealer learned to make the best metheveryone goes to this Meth Dealer for best shotgirlfriend of the meth dealer diedthis lawyer is a people pleasing personcinnabon has now hired the lawyer as a bakerlawyer had to take off in the endlawyer has a lot of connections who knows other guy

print type(l)
<type 'str'>

#remove duplicity words
words = l.split()
individual_words = " ".join(sorted(set(words), key=words.index))
print individual_words
meth cook makes with purity of over is also called Heisenbergmeth has cancerhe known as the best cookMeth Dealer added chili powder to his batchMeth learned make metheveryone goes this Meth for shotgirlfriend dealer diedthis lawyer a people pleasing personcinnabon now hired bakerlawyer had take off in endlawyer lot connections who knows other guy

#write to file  
with open("Output.txt", "w") as text_file:
    text_file.write(individual_words)