我有一个示例数据框如下:
df = pd.DataFrame({
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']),
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]),
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan])
})
看起来如下:
name notes occupation
NaN meth cook makes meth with purity of over 96% meth cook
Walter White meth cook is also called Heisenberg NaN
NaN meth cook has cancer NaN
NaN he is known as the best meth cook NaN
NaN Meth Dealer added chili powder to his batch NaN
NaN Meth Dealer learned to make the best meth NaN
Jessie Pinkman everyone goes to this Meth Dealer for best shot meth dealer
NaN girlfriend of the meth dealer died NaN
Saul Goodman this lawyer is a people pleasing person NaN
NaN cinnabon has now hired the lawyer as a baker lawyer
NaN lawyer had to take off in the end NaN
NaN lawyer has a lot of connections who knows other guy NaN
我想创建一个单词/字谜列表来自' notes'柱。我还想从'笔记'中排除任何数字/特殊字符。列(例如:我不希望输出中有96%)。
我还想将所有单词(没有重复)写入文本文件。
我怎样才能在Python中执行此操作?
答案 0 :(得分:2)
IIUC您可以使用str.replace
删除数字和特殊字符:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']),
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]),
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan])
})
#remove all numbers and #*
df['notes'] = df['notes'].str.replace(r"[0-9%*]+","")
print df
name notes \
0 NaN meth cook makes meth with purity of over
1 Walter White meth cook is also called Heisenberg
2 NaN meth cook has cancer
3 NaN he is known as the best meth cook
4 NaN Meth Dealer added chili powder to his batch
5 NaN Meth Dealer learned to make the best meth
6 Jessie Pinkman everyone goes to this Meth Dealer for best shot
7 NaN girlfriend of the meth dealer died
8 Saul Goodman this lawyer is a people pleasing person
9 NaN cinnabon has now hired the lawyer as a baker
10 NaN lawyer had to take off in the end
11 NaN lawyer has a lot of connections who knows othe...
occupation
0 meth cook
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 meth dealer
7 NaN
8 NaN
9 lawyer
10 NaN
11 NaN
#all string to one big string
l = df['notes'].sum()
print l
meth cook makes meth with purity of over meth cook is also called Heisenbergmeth cook has cancerhe is known as the best meth cookMeth Dealer added chili powder to his batchMeth Dealer learned to make the best metheveryone goes to this Meth Dealer for best shotgirlfriend of the meth dealer diedthis lawyer is a people pleasing personcinnabon has now hired the lawyer as a bakerlawyer had to take off in the endlawyer has a lot of connections who knows other guy
print type(l)
<type 'str'>
#remove duplicity words
words = l.split()
individual_words = " ".join(sorted(set(words), key=words.index))
print individual_words
meth cook makes with purity of over is also called Heisenbergmeth has cancerhe known as the best cookMeth Dealer added chili powder to his batchMeth learned make metheveryone goes this Meth for shotgirlfriend dealer diedthis lawyer a people pleasing personcinnabon now hired bakerlawyer had take off in endlawyer lot connections who knows other guy
#write to file
with open("Output.txt", "w") as text_file:
text_file.write(individual_words)