Question

我试图用句子分割句子。

LOAD DATA LOCAL INFILE

这给了我像

这样的单词列表

words = content.lower().split()

并使用此代码：

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

我得到类似的东西：

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

如果你在列表中看到“morningthe”这个词，那么它之间的单词之间就会有“ - ”。现在，有什么办法可以用'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day' ??

这两个词来分割它们

Answer 1

我建议使用基于正则表达式的解决方案：

import re

def to_words(text):
    return re.findall(r'\w+', text)

这会查找所有单词 - 字母字符组，忽略符号，分隔符和空格。

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

请注意，如果您循环使用单词，使用返回生成器对象的re.finditer可能会更好，因为您不能同时存储整个单词列表。

Answer 2

或者，您也可以使用itertools.groupby和str.alpha()从字符串中提取仅限字母的字词：

>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS：基于正则表达式的解决方案更清晰。我已经提到这是实现这一目标的可能替代方案。

特定于OP ：如果你想要的只是在结果列表中的--上拆分，那么你可以先用空格{{1}替换连字符'-'在执行拆分之前。因此，您的代码应为：

' '

其中words = content.lower().replace('-', ' ').split()将保留您想要的值。

Answer 3

尝试使用正则表达式执行此操作会让您发疯。例如

>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

绝对查看nltk包。

Answer 4

除了已经提供的解决方案，您还可以改进clean_up_list功能以更好地完成工作。

def clean_up_list(word_list):
    clean_word_list = []
    # Move the list out of loop so that it doesn't
    # have to be initiated every time.
    symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"

    for word in word_list:
        current_word = ''
        for index in range(len(word)):
            if word[index] in symbols:
                if current_word:
                    clean_word_list.append(current_word)
                    current_word = ''
            else:
                current_word += word[index]

        if current_word:
            # Append possible last current_word
            clean_word_list.append(current_word)

    return clean_word_list

实际上，您可以将for word in word_list:中的块应用于整个句子以获得相同的结果。

Answer 5

你也可以这样做：

import re

def word_list(text):
  return list(filter(None, re.split('\W+', text)))

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

返回：

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

拆分python中的句子

5 个答案: