Question

我试图从文本文件中读取并创建一个开始一个句子的种子单词列表，以及一个包含排除种子单词的所有相邻单词的第二个列表。

我遇到的问题是包含撇号的单词在撇号之后被分割，而其余单词被省略。你如何保留它们出现在文件中？

文件中包含的文字：

This doesn't seem to work. Is findall or sub the correct approach? Or neither?

CODE：

my_string = open('sample.txt', 'r').read()

starter = list(set(re.findall(r"(?<![a-z]\s)[A-Z]\w+", my_string)))
adjacent = re.findall(r"(?<!(?<![a-z]\s))\w+", my_string) 

print(my_string)

结果：

['doesn', 'seem', 'to', 'work', 'sub', 'or', 'findall', 'the', 'correct', 'approach', 'neither']

Answer 1

我遇到的问题是包含撇号的单词在撇号之后被分割，而其余单词被省略。

斜线-W-plus 不是你的朋友。它是字母字符，数字和下划线的捷径。它不包括连字符或撇号。

请改用字符范围。这样你就可以包含撇号并排除数字和下划线：

r"[A-Za-z\']+"           # works better than \w+

Answer 2

两个正则表达式更容易：

import re

txt="""\
This doesn't seem to work. Is findall or sub the correct approach? Or neither? Isn't it grand?
"""

first_words=set(re.findall(r'(?:^|(?:[.!?]\s))(\b[a-zA-Z\']+)', txt))

rest={word for word in re.findall(r'(\b[a-zA-Z\']+)', txt) if word not in first_words}

print first_words
# set(['This', 'Is', 'Or', "Isn't"])
print rest
# set(["doesn't", 'sub', 'grand', 'the', 'work', 'it', 'findall', 'to', 'neither', 'correct', 'seem', 'approach', 'or'])

Python正则表达式帮助。组合

2 个答案: