Question

我一直在研究一个程序，该程序查找仅在文本中出现一次的单词。但是，当程序找到一个单词时，我希望它为该单词提供一些上下文。

这是我的代码。

from collections import Counter
from string import punctuation

text = str("bible.txt")
with open(text) as f:
     word_counts = Counter(word.strip(punctuation) for line in f for word in 
line.split())

unique = [word.lower() for word, count in word_counts.items() if count == 1]

with open(text, 'r') as myfile:
    wordlist = myfile.read().lower()

print(unique)
print(len(unique), " unique words found.")

for word in unique:
    first = 1
    second = 1
    index = wordlist.index(word)
    if wordlist[index - first:index] is not int():
        first += 1
    if wordlist[index:index + second] is not ".":
        second += 1
    print(" ")

    first_part = wordlist[index - first:index]
    second_part = wordlist[index:index + second]
    print(word)
    print("%s %s" % ("".join(first_part), "".join(second_part)))

this是输入文本。

理想情况下，它会显示

sojournings
1 Jacob lived in the land of his father's sojournings, in the land of 
Canaan.

generations
2 These are the generations of Jacob.

基本上我希望它显示单词所在的句子，开头是诗句编号。我知道我会对索引做些什么，但是老实说我不知道该怎么做。

任何帮助将不胜感激。

谢谢，本

Answer 1

我将检索所选单词的第一个字母的索引（在整个字符串中，对于圣经来说，这将是长;'），然后找到第一个“”。在那封信之前。我还会找到“下一个”“。”，但是可能强制使用最小长度以确保小句中的上下文。这给了您包括/打印/显示的范围。

def stringer():

    mystring = """ the quick brown fox. Which jumped over the lazy dog and died a horrible death. ad ipsum valorem"""

    word_posn = mystring.find("lazy")
    start_posn = mystring[:word_posn].rfind(".") + 1
    end_posn = mystring[word_posn:].find(".")+word_posn +1

    return '"' + mystring[start_posn:end_posn].strip() + '"'

此代码的编码速度非常快，因此为出现的错误表示歉意。

Answer 2

我将把完整的代码留在这里给以后遇到的任何人。

from collections import Counter
from string import punctuation
import time

path = input("Path to file: ")
with open(path) as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())

wordlist = open(path).read().replace('\n', '')

unique = [word for word, count in word_counts.items() if count == 1]

print(unique)
print(len(unique), " unique words found.")

for word in unique:
    print(" ")
    word_posn = wordlist.find(word)
    start_posn = wordlist[:word_posn].rfind("." or "," or "!" or "?")) + 1
    end_posn = wordlist[word_posn:].find("." or "," or "!" or "?")) + word_posn + 1
    print(word)
    print(wordlist[start_posn:end_posn])

也要向@lb_so大喊帮助！

在查找关键字时处理文本文件

2 个答案: