Question

我有两个文本文件，一个是sample.txt，另一个是common.txt。首先，我想从sample.txt中删除常用单词。在common.txt和代码sample.txt中找到常用词已根据需要进行了修改。 common.txt是：

a
about
after
again
against
ago
all
along
also
always
an
and
another
any
are
around
as
at
away
back
be
because
been
before
began
being
between
both
but
by
came
can
come
could
course
day
days
did
do
down
each
end
even
ever
every
first
for
four
from
get
give
go
going
good
got
great
had
half
has
have
he
head
her
here
him
his
house
how
hundred
i
if
in
into
is
it
its
just
know
last
left
life
like
little
long
look
made
make
man
many
may
me
men
might
miles
more
most
mr
much
must
my
never
new
next
no
not
nothing
now
of
off
old
on
once
one
only
or
other
our
out
over
own
people
pilot
place
put
right
said
same
saw
say
says
see
seen
she
should
since
so
some
state
still
such
take
tell
than
that
the
their
them
then
there
these
they
thing
think
this
those
thousand
three
through
time
times
to
told
too
took
two
under
up
upon
us
use
used
very
want
was
way
we
well
went
were
what
when
where
which
while
who
will
with
without
work
world
would
year
years
yes
yet
you
young
your

sample.txt是：

    THE Mississippi is well worth reading about. It is not a commonplace
river, but on the contrary is in all ways remarkable. Considering the
Missouri its main branch, it is the longest river in the world--four
thousand three hundred miles. It seems safe to say that it is also the
crookedest river in the world, since in one part of its journey it uses
up one thousand three hundred miles to cover the same ground that the
crow would fly over in six hundred and seventy-five. It discharges three
times as much water as the St. Lawrence, twenty-five times as much
as the Rhine, and three hundred and thirty-eight times as much as the
Thames. No other river has so vast a drainage-basin: it draws its water
supply from twenty-eight States and Territories; from Delaware, on the
Atlantic seaboard, and from all the country between that and Idaho on
the Pacific slope--a spread of forty-five degrees of longitude. The
Mississippi receives and carries to the Gulf water from fifty-four
subordinate rivers that are navigable by steamboats, and from some
hundreds that are navigable by flats and keels. The area of its
drainage-basin is as great as the combined areas of England, Wales,
Scotland, Ireland, France, Spain, Portugal, Germany, Austria, Italy,
and Turkey; and almost all this wide region is fertile; the Mississippi
valley, proper, is exceptionally so.

删除常用词后，我需要将其分解为句子并使用＆＃34;。＆＃34;句子中的目标词的完整停止和计数。此外，需要为目标词创建配置文件以显示关联的单词及其计数。例如，如果＆＃34; river＆＃34;是目标词，相关词包括＆＃34;普通＆＃34;，＆＃34;相反＆＃34;以及＃34; river＆＃34;在同一句话中（在一个句号内）发生的事情。所需的输出按降序列出：

river 4
     ground: 1
     journey: 1
     longitude: 1
     main: 1
     world--four: 1
     contrary: 1
     cover: 1
     ...
mississippi 3
     area: 1
     steamboats: 1
     germany: 1
     reading: 1
     france: 1
     proper: 1
     ...

三个点表示关联的单词应该更多，并且不在此处列出。现在这里是编码到目前为止：

def open_file(file):
file = "/Users/apple/Documents/sample.txt"
file1 = "/Users/apple/Documents/common.txt"
with open(file1, "r") as f:  
    common_words = {i.strip() for i in f}  

punctionmark = ":;,'\"."   
trans_table = str.maketrans(punctionmark, " " * len(punctionmark))

word_counter = {} 
with open(file, "r") as f: 
    for line in f: 
        for word in line.translate(trans_table).split(): 
            if word.lower() not in common_words: 
                word_counter[word.lower()] = word_counter.get(word, 0) + 1 
                #print(word_counter)

print("\n".join("{} {}".format(w, c) for w, c in word_counter.items()))

现在我的输出是：

mississipi 1
reading 1
about 1
commonplace 1
river 4
.
.
.

到目前为止，我已计算了目标词的出现次数，但仍坚持按降序对目标词进行排序，并获得相关词的计数。任何人都可以在不导入其他模块的情非常感谢。

Answer 1

您可以使用re.findall对文本进行标记，过滤和分组，然后遍历目标和相关单词的结构以查找最终计数：

import re, string
from collections import namedtuple
import itertools
stop_words = [i.strip('\n') for i in open('filename.txt')]
text = open('filename.txt').read()
grammar = {'punctuation':string.punctuation, 'stopword':stop_words}
token = namedtuple('token', ['name', 'value'])
tokenized_file = [token((lambda x:'word' if not x else x[0])([a for a, b in grammar.items() if i.lower() in b]), i) for i in re.findall('\w+|\!|\-|\.|;|,:', text)]
filtered_file = [i for i in tokenized_file if i.name != 'stopword']
grouped_data = [list(b) for _, b in itertools.groupby(filtered_file, key=lambda x:x.value not in '!.?')]
text_with_sentences = ' '.join([' '.join([c.value for c in grouped_data[i]])+grouped_data[i+1][0].value for i in range(0, len(grouped_data), 2)])

目前，text_with_sentences的结果是：

'Mississippi worth reading. commonplace river contrary ways remarkable. Considering Missouri main branch longest river - -. seems safe crookedest river part journey uses cover ground crow fly six seventy - five. discharges water St. Lawrence twenty - five Rhine thirty - eight Thames. river vast drainage - basin draws water supply twenty - eight States Territories ; Delaware Atlantic seaboard country Idaho Pacific slope - - spread forty - five degrees longitude. Mississippi receives carries Gulf water fifty - subordinate rivers navigable steamboats hundreds navigable flats keels. area drainage - basin combined areas England Wales Scotland Ireland France Spain Portugal Germany Austria Italy Turkey ; almost wide region fertile ; Mississippi valley proper exceptionally.'

要查找关键字分析的计数，您可以使用collections.Counter：

import collections
counts = collections.Counter(map(str.lower, re.findall('[\w\-]+', text)))
structure = [['river', ['ground', 'journey', 'longitude', 'main', 'world--four', 'contrary', 'cover']], ['mississippi', ['area', 'steamboats', 'germany', 'reading', 'france', 'proper']]]
new_structure = [{'keyword':counts.get(a, 0), 'associated':{i:counts.get(i, 0) for i in b}} for a, b in structure]

输出：

[{'associated': {'cover': 1, 'longitude': 1, 'journey': 1, 'contrary': 1, 'main': 1, 'world--four': 1, 'ground': 1}, 'keyword': 4}, {'associated': {'area': 1, 'france': 1, 'germany': 1, 'proper': 1, 'reading': 1, 'steamboats': 1}, 'keyword': 3}]

不使用任何模块，可以使用str.split：

words = [[i[:-1], i[-1]] if i[-1] in string.punctuation else [i] for i in text.split()]
new_words = [i for b in words for i in b if i.lower() not in stop_words]
def find_groups(d, _pivot = '.'):
   current = [] 
   for i in d: 
     if i == _pivot:
       yield ' '.join(current)+'.'
       current = []
     else:
       current.append(i)

print(list(find_groups(new_words)))
counts = {}
for i in new_words:
   if i.lower() not in counts:
     counts[i.lower()] = 1
   else:
     counts[i.lower()] += 1

structure = [['river', ['ground', 'journey', 'longitude', 'main', 'world--four', 'contrary', 'cover']], ['mississippi', ['area', 'steamboats', 'germany', 'reading', 'france', 'proper']]]
new_structure = [{'keyword':counts.get(a, 0), 'associated':{i:counts.get(i, 0) for i in b}} for a, b in structure]

输出：

['Mississippi worth reading.', 'commonplace river , contrary ways remarkable.', 'Considering Missouri main branch , longest river world--four.', 'seems safe crookedest river , part journey uses cover ground crow fly six seventy-five.', 'discharges water St.', 'Lawrence , twenty-five Rhine , thirty-eight Thames.', 'river vast drainage-basin : draws water supply twenty-eight States Territories ; Delaware , Atlantic seaboard , country Idaho Pacific slope--a spread forty-five degrees longitude.', 'Mississippi receives carries Gulf water fifty-four subordinate rivers navigable steamboats , hundreds navigable flats keels.', 'area drainage-basin combined areas England , Wales , Scotland , Ireland , France , Spain , Portugal , Germany , Austria , Italy , Turkey ; almost wide region fertile ; Mississippi valley , proper , exceptionally.']
[{'associated': {'cover': 1, 'longitude': 1, 'journey': 1, 'contrary': 1, 'main': 1, 'world--four': 1, 'ground': 1}, 'keyword': 4}, {'associated': {'area': 1, 'france': 1, 'germany': 1, 'proper': 1, 'reading': 1, 'steamboats': 1}, 'keyword': 3}]

如何按字典值对目标词进行排序并计算相关词？

1 个答案: