Question

好吧，所以我在网站上搜索Python中解析长字符串（或句子，如果你喜欢的话）的搜索失败了。如果有一个以前回答相同性质的问题，请转发给我！无论如何，嗨！我是一名初学程序员（使用互联网自学Python），我正在寻找一个（看似简单）问题的帮助。如果您对此问题有任何意见，请随意回答您认为合适的问题，但如果您向我解释一下您的解决方案或编码示例，我会给您带来更深入的帮助！此外，我解决这个问题的唯一想法是使用ascii值删除所有puntuation将是一个非常长的if语句然后通过使用剩余的空格分割剩余的文本，同时将它们附加到列表。为了节省你的时间和我学习新东西，我宁愿看不到最长的表达陈述！另请注意，这是一个返回列表的函数，因此不要将它（后面）转换为字符串或转换为不同的数据类型（如字典）。提前感谢您提供的任何帮助！

这里有一个问题：

解析字符串

创建一个函数，该函数将字符串作为输入并返回＆gt;字符串中所有单词的列表。它应该删除所有标点符号，用空格替换破折号。

实施例（呼叫）：

    >>> parse("Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.") 
   [Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, for, a, system, of, government, Supreme, executive, power, derives, from, a, mandate, from, the, masses, not, from, some, farcical, aquatic, ceremony] 
    >>> parse("What... is the air-speed velocity of an unladen swallow?") 
    [What, is, the, air, speed, velocity, of, an, unladen, swallow]

对于代码长度的运行我很抱歉！无论如何，我认为你们都明白应该从问题本身做什么。绝对欢迎任何建议或独特/有效的解决方案！ - 温克尔森

P.S。对于连续的句子和“文本墙”感到抱歉。我有点健谈......无论如何，再次感谢任何帮助！

请注意，输出不是列表！更多的符号不能包含在答案中！请不要忘记！再次感谢你的帮助！对于不确定的抱怨，问题的作者不知所措！

Answer 1

使用Natural Language Toolkit (nltk)非常简单。

import nltk, string
text = "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."

tokens = nltk.word_tokenize(text)

# remove punctuation
tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]

使用中：

>>> text = "Listen, strange women lyin' in ponds distributin' swords is no basis
 for a system of government. Supreme executive power derives from a mandate from
 the masses, not from some farcical aquatic ceremony."
>>> tokens = nltk.word_tokenize(text)
>>> tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
>>> tokens
['Listen', 'strange', 'women', 'lyin', 'in', 'ponds', 'distributin', 'swords', '
is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execu
tive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', 'not
', 'from', 'some', 'farcical', 'aquatic', 'ceremony']

显然，您想要的输出非常不清楚，但是如果您正在寻找该输出的字符串版本，则可以使用tokens变量并执行：

print '[' + ', '.join(tokens) + ']'

看起来像：

>>> print '['+', '.join(tokens)+']'
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, fo
r, a, system, of, government., Supreme, executive, power, derives, from, a, mand
ate, from, the, masses, not, from, some, farcical, aquatic, ceremony]

你的“文本墙”确实很难弄清楚你想要什么。

Answer 2

In [133]: punc = set('.,<>!@#$%^&*()-_+=]}{[\\|')

In [134]: [''.join(char for char in word if char not in punc) for word in "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.".split()]
Out[134]: 
['Listen',
 'strange',
 'women',
 "lyin'",
 'in',
 'ponds',
 "distributin'",
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony']

Answer 3

我建议使用regular expression，就像这样

import re

re.findall(r'[a-zA-Z]+',input_string)

或者为了做多个字符串，首先编译正则表达式

regexp=re.compile(r'[a-zA-Z]+')
regexp.findall(test)

基本上，这是要求所有包含字母的字符，按字符分组。如果你想要包括缩小词，你可以添加'到表达式，如下：

re.findall(r'[a-zA-Z']+',input_string)

在Python中解析句子（或其他更长的字符串）（ProblemSetQuestion）如何进行？

3 个答案: