好吧,所以我在网站上搜索Python中解析长字符串(或句子,如果你喜欢的话)的搜索失败了。如果有一个以前回答相同性质的问题,请转发给我!无论如何,嗨!我是一名初学程序员(使用互联网自学Python),我正在寻找一个(看似简单)问题的帮助。如果您对此问题有任何意见,请随意回答您认为合适的问题,但如果您向我解释一下您的解决方案或编码示例,我会给您带来更深入的帮助!此外,我解决这个问题的唯一想法是使用ascii值删除所有puntuation将是一个非常长的if语句然后通过使用剩余的空格分割剩余的文本,同时将它们附加到列表。为了节省你的时间和我学习新东西,我宁愿看不到最长的表达陈述!另请注意,这是一个返回列表的函数,因此不要将它(后面)转换为字符串或转换为不同的数据类型(如字典)。提前感谢您提供的任何帮助!
这里有一个问题:
解析字符串
创建一个函数,该函数将字符串作为输入并返回>字符串中所有单词的列表。它应该删除所有标点符号,用空格替换破折号。
实施例(呼叫):
>>> parse("Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.")
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, for, a, system, of, government, Supreme, executive, power, derives, from, a, mandate, from, the, masses, not, from, some, farcical, aquatic, ceremony]
>>> parse("What... is the air-speed velocity of an unladen swallow?")
[What, is, the, air, speed, velocity, of, an, unladen, swallow]
对于代码长度的运行我很抱歉!无论如何,我认为你们都明白应该从问题本身做什么。绝对欢迎任何建议或独特/有效的解决方案! - 温克尔森
P.S。对于连续的句子和“文本墙”感到抱歉。我有点健谈......无论如何,再次感谢任何帮助!
请注意,输出不是列表!更多的符号不能包含在答案中!请不要忘记!再次感谢你的帮助!对于不确定的抱怨,问题的作者不知所措!
答案 0 :(得分:3)
使用Natural Language Toolkit (nltk)非常简单。
import nltk, string
text = "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."
tokens = nltk.word_tokenize(text)
# remove punctuation
tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
使用中:
>>> text = "Listen, strange women lyin' in ponds distributin' swords is no basis
for a system of government. Supreme executive power derives from a mandate from
the masses, not from some farcical aquatic ceremony."
>>> tokens = nltk.word_tokenize(text)
>>> tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
>>> tokens
['Listen', 'strange', 'women', 'lyin', 'in', 'ponds', 'distributin', 'swords', '
is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execu
tive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', 'not
', 'from', 'some', 'farcical', 'aquatic', 'ceremony']
显然,您想要的输出非常不清楚,但是如果您正在寻找该输出的字符串版本,则可以使用tokens
变量并执行:
print '[' + ', '.join(tokens) + ']'
看起来像:
>>> print '['+', '.join(tokens)+']'
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, fo
r, a, system, of, government., Supreme, executive, power, derives, from, a, mand
ate, from, the, masses, not, from, some, farcical, aquatic, ceremony]
你的“文本墙”确实很难弄清楚你想要什么。
答案 1 :(得分:2)
In [133]: punc = set('.,<>!@#$%^&*()-_+=]}{[\\|')
In [134]: [''.join(char for char in word if char not in punc) for word in "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.".split()]
Out[134]:
['Listen',
'strange',
'women',
"lyin'",
'in',
'ponds',
"distributin'",
'swords',
'is',
'no',
'basis',
'for',
'a',
'system',
'of',
'government',
'Supreme',
'executive',
'power',
'derives',
'from',
'a',
'mandate',
'from',
'the',
'masses',
'not',
'from',
'some',
'farcical',
'aquatic',
'ceremony']
答案 2 :(得分:1)
我建议使用regular expression,就像这样
import re
re.findall(r'[a-zA-Z]+',input_string)
或者为了做多个字符串,首先编译正则表达式
regexp=re.compile(r'[a-zA-Z]+')
regexp.findall(test)
基本上,这是要求所有包含字母的字符,按字符分组。如果你想要包括缩小词,你可以添加'到表达式,如下:
re.findall(r'[a-zA-Z']+',input_string)