我试图从文本文件中读取并创建一个开始一个句子的种子单词列表,以及一个包含排除种子单词的所有相邻单词的第二个列表。
我遇到的问题是包含撇号的单词在撇号之后被分割,而其余单词被省略。你如何保留它们出现在文件中?
文件中包含的文字:
This doesn't seem to work. Is findall or sub the correct approach? Or neither?
CODE:
my_string = open('sample.txt', 'r').read()
starter = list(set(re.findall(r"(?<![a-z]\s)[A-Z]\w+", my_string)))
adjacent = re.findall(r"(?<!(?<![a-z]\s))\w+", my_string)
print(my_string)
结果:
['doesn', 'seem', 'to', 'work', 'sub', 'or', 'findall', 'the', 'correct', 'approach', 'neither']
答案 0 :(得分:0)
我遇到的问题是包含撇号的单词在撇号之后被分割,而其余单词被省略。
斜线-W-plus 不是你的朋友。它是字母字符,数字和下划线的捷径。它不包括连字符或撇号。
请改用字符范围。这样你就可以包含撇号并排除数字和下划线:
r"[A-Za-z\']+" # works better than \w+
答案 1 :(得分:0)
两个正则表达式更容易:
import re
txt="""\
This doesn't seem to work. Is findall or sub the correct approach? Or neither? Isn't it grand?
"""
first_words=set(re.findall(r'(?:^|(?:[.!?]\s))(\b[a-zA-Z\']+)', txt))
rest={word for word in re.findall(r'(\b[a-zA-Z\']+)', txt) if word not in first_words}
print first_words
# set(['This', 'Is', 'Or', "Isn't"])
print rest
# set(["doesn't", 'sub', 'grand', 'the', 'work', 'it', 'findall', 'to', 'neither', 'correct', 'seem', 'approach', 'or'])