Question

我的代码能够读取文本xlsx文件。它显示单词频率（单词出现了多少次）。但是我想删除标点符号，表达式（＃，$，％）和不必要的单词形式，这些都是计数或打印出来的。

代码：

Task.Run(()=> {
    //.... some work....
})
// We could wait now, so we any exceptions are thrown, but that 
// would make the code synchronous. Instead, we continue only if 
// the task fails.
.ContinueWith(t => {
    // This is always true since we ContinueWith OnlyOnFaulted,
    // But we add the condition anyway so resharper doesn't bark.
    if (t.Exception != null)  throw t.Exception;
}, default
     , TaskContinuationOptions.OnlyOnFaulted
     , TaskScheduler.FromCurrentSynchronizationContext());

输出：

import pandas as pd
import re



stop_words = [
"a", "about", "above", "across", "after", "afterwards",
"again", "all", "almost", "alone", "along", "already", "also",
"although", "always", "am", "among", "amongst", "amoungst", "amount", "an",
"and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "as", "at", "be", "became",
"because", "become","becomes", "becoming", "been", "before", "behind", "being", "beside", "besides", "between",
"beyond", "both", "but", "by","can", "cannot", "cant", "could", "couldnt", "de", "describe", "do", "done", "each",
"eg", "either", "else", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "find","for",
"found", "four", "from", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein",
"hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "i", "ie", "if", "in", "indeed", "is", "it", "its", "itself", "keep", "least",
"less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mine", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "name",
"namely", "neither", "never", "nevertheless", "next","no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often",
"on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part","perhaps", "please",
"put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "she", "should","since", "sincere","so", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhere", "still", "such", "take","than", "that", "the", "their", "them", "themselves", "then", "thence", "there"
"thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they",
"this", "those", "though", "through", "throughout",
"thru", "thus", "to", "together", "too", "toward", "towards",
"under", "until", "up", "upon", "us",
"very", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while",
"who", "whoever", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
]


df = pd.read_excel('C:\\Users\\farid-PC\\Desktop\\Tester.xlsx')
pd.set_option('display.max_colwidth', 1000)
frequency = df.Text.str.split(expand=True).stack().value_counts()
T = 450 #total number of words in file
word_freq = frequency/T
print(word_freq)

Answer 1

如果您使用的是Python3，请尝试使用str.maketrans（）方法查看下面的简单代码。请注意，在打印字符串时，所有不需要的字符都会被删除。

intab = "!#&"   #string of chars you don't want
outtab = "   "  # must have same no. of spaces as chars in intab
trantab = str.maketrans(intab, outtab)

str="This ! string # has & unwanted ! stuff &"

print(str.translate(trantab))

output =该字符串有多余的东西

仔细阅读代码注释！ outtab变量，其中包含您要替换不需要的字符的任何内容，其中的字符数必须与intab相同。

希望这会有所帮助！比尔

Answer 2

可能不是一种有效的解决方案，但似乎可以产生正确的输出：

exclude = list(string.punctuation) + stop_words + ['--']
remove = re.compile('[%s]' % string.punctuation)

df = pd.read_excel('C:\\Users\\farid-PC\\Desktop\\Tester.xlsx')
pd.set_option('display.max_colwidth', 1000)

# count the words in the file
# count = 0
# for l in df['Text']:
#    count += len(l.split())

f = []
for i, s in enumerate(df['Text']):
    try:
        s = s.lower()
    except AttributeError:
        pass
    no_nums = re.sub(r'[0-9]+', '', s)
    o = remove.sub('', no_nums)
    line = o.split()
    common = list(set(line).intersection(exclude))
    line = ' '.join(word for word in line if word not in common) 
    f.append(line)

ndf = pd.DataFrame({'Text': f})

frequency = ndf.Text.str.split(expand=True).stack().value_counts()

T = 450 # consider change to value in `count`
word_freq = frequency/T

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(word_freq)

输出：

says              0.012632
percent           0.010526
million           0.008421
federal           0.008421
trump             0.008421
first             0.006316
government        0.006316
know              0.006316
donald            0.006316
year              0.006316
clinton           0.004211
half              0.004211
worth             0.004211
hillary           0.004211
reagan            0.004211
banks             0.004211
there             0.004211
years             0.004211
people            0.004211
tax               0.004211
ronald            0.004211
did               0.004211
democrats         0.004211
goes              0.004211
food              0.004211
company           0.004211
gave              0.004211
paid              0.002105
plan              0.002105
play              0.002105
campaign          0.002105
advocated         0.002105
scott             0.002105
legislation       0.002105
equality          0.002105
newt              0.002105
address           0.002105
vehicle           0.002105
health            0.002105
law               0.002105
pace              0.002105
wall              0.002105
individual        0.002105
minimum           0.002105
proceeds          0.002105
spend             0.002105
center            0.002105
false             0.002105
faced             0.002105
county            0.002105
bringing          0.002105
help              0.002105
got               0.002105
requires          0.002105
projects          0.002105
handling          0.002105
clintons          0.002105
worse             0.002105
gov               0.002105
package           0.002105
foundation        0.002105
retirement        0.002105
vice              0.002105
like              0.002105
bill              0.002105
agriculture       0.002105
biggest           0.002105
stabilize         0.002105
meetings          0.002105
employees         0.002105
walker            0.002105
congress          0.002105
confiscation      0.002105
back              0.002105
economic          0.002105
scammed           0.002105
marriage          0.002105
road              0.002105
per               0.002105
biden             0.002105
documents         0.002105
congressman       0.002105
texas             0.002105
toxic             0.002105
drop              0.002105
fed               0.002105
superiors         0.002105
sales             0.002105
shelby            0.002105
deport            0.002105
edwards           0.002105
alcohol           0.002105
ginsburg          0.002105
american          0.002105
created           0.002105
proposed          0.002105
act               0.002105
nodded            0.002105
proposes          0.002105
layoffs           0.002105
during            0.002105
mike              0.002105
john              0.002105
receive           0.002105
operations        0.002105
disability        0.002105
state             0.002105
joint             0.002105
wisconsin         0.002105
medicare          0.002105
given             0.002105
citizenship       0.002105
billion           0.002105
north             0.002105
increase          0.002105
scalia            0.002105
halfcent          0.002105
big               0.002105
president         0.002105
criminal          0.002105
commute           0.002105
transportation    0.002105
tennessee         0.002105
double            0.002105
birthright        0.002105
recent            0.002105
suzanne           0.002105
advocating        0.002105
attacks           0.002105
building          0.002105
contributors      0.002105
fact              0.002105
poll              0.002105
recession         0.002105
say               0.002105
schools           0.002105
mccain            0.002105
usmexico          0.002105
mandate           0.002105
just              0.002105
nations           0.002105
threat            0.002105
including         0.002105
security          0.002105
stimulus          0.002105
seniors           0.002105
flores            0.002105
morning           0.002105
considering       0.002105
wants             0.002105
time              0.002105
cut               0.002105
gun               0.002105
role              0.002105
recovery          0.002105
military          0.002105
five              0.002105
single            0.002105
georgia           0.002105
want              0.002105
stamps            0.002105
advantage         0.002105
benefits          0.002105
literally         0.002105
vets              0.002105
reporter          0.002105
gallup            0.002105
afternoon         0.002105
tasked            0.002105
violate           0.002105
bomb              0.002105
days              0.002105
spending          0.002105
rid               0.002105
joe               0.002105
marijuana         0.002105
bonamici          0.002105
care              0.002105
korea             0.002105
votes             0.002105
fund              0.002105
scheme            0.002105
major             0.002105
ri                0.002105
laws              0.002105
number            0.002105
deceased          0.002105
yes               0.002105
session           0.002105
trillion          0.002105
wage              0.002105
said              0.002105
past              0.002105
pence             0.002105
republicans       0.002105
gingrich          0.002105
asked             0.002105
against           0.002105
americans         0.002105
plus              0.002105
current           0.002105
foreign           0.002105
politifact        0.002105
committed         0.002105
affecting         0.002105
supports          0.002105
choice            0.002105
admits            0.002105
border            0.002105
secretary         0.002105
hes               0.002105
former            0.002105
recently          0.002105
country           0.002105
dtype: float64

如何从python中的excel文件中预处理数据？

2 个答案: