在Python中解析句子(或其他更长的字符串)(ProblemSetQuestion)如何进行?

时间:2012-11-20 19:46:33

标签: python python-3.x

好吧,所以我在网站上搜索Python中解析长字符串(或句子,如果你喜欢的话)的搜索失败了。如果有一个以前回答相同性质的问题,请转发给我!无论如何,嗨!我是一名初学程序员(使用互联网自学Python),我正在寻找一个(看似简单)问题的帮助。如果您对此问题有任何意见,请随意回答您认为合适的问题,但如果您向我解释一下您的解决方案或编码示例,我会给您带来更深入的帮助!此外,我解决这个问题的唯一想法是使用ascii值删除所有puntuation将是一个非常长的if语句然后通过使用剩余的空格分割剩余的文本,同时将它们附加到列表。为了节省你的时间和我学习新东西,我宁愿看不到最长的表达陈述!另请注意,这是一个返回列表的函数,因此不要将它(后面)转换为字符串或转换为不同的数据类型(如字典)。提前感谢您提供的任何帮助!

这里有一个问题:


  

解析字符串

     

创建一个函数,该函数将字符串作为输入并返回>字符串中所有单词的列表。它应该删除所有标点符号,用空格替换破折号。


实施例(呼叫):

    >>> parse("Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.") 
   [Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, for, a, system, of, government, Supreme, executive, power, derives, from, a, mandate, from, the, masses, not, from, some, farcical, aquatic, ceremony] 
    >>> parse("What... is the air-speed velocity of an unladen swallow?") 
    [What, is, the, air, speed, velocity, of, an, unladen, swallow]

对于代码长度的运行我很抱歉!无论如何,我认为你们都明白应该从问题本身做什么。绝对欢迎任何建议或独特/有效的解决方案! - 温克尔森

P.S。对于连续的句子和“文本墙”感到抱歉。我有点健谈......无论如何,再次感谢任何帮助!

请注意,输出不是列表!更多的符号不能包含在答案中!请不要忘记!再次感谢你的帮助!对于不确定的抱怨,问题的作者不知所措!

3 个答案:

答案 0 :(得分:3)

使用Natural Language Toolkit (nltk)非常简单。

import nltk, string
text = "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."

tokens = nltk.word_tokenize(text)

# remove punctuation
tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]

使用中:

>>> text = "Listen, strange women lyin' in ponds distributin' swords is no basis
 for a system of government. Supreme executive power derives from a mandate from
 the masses, not from some farcical aquatic ceremony."
>>> tokens = nltk.word_tokenize(text)
>>> tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
>>> tokens
['Listen', 'strange', 'women', 'lyin', 'in', 'ponds', 'distributin', 'swords', '
is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execu
tive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', 'not
', 'from', 'some', 'farcical', 'aquatic', 'ceremony']

显然,您想要的输出非常不清楚,但是如果您正在寻找该输出的字符串版本,则可以使用tokens变量并执行:

print '[' + ', '.join(tokens) + ']'

看起来像:

>>> print '['+', '.join(tokens)+']'
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, fo
r, a, system, of, government., Supreme, executive, power, derives, from, a, mand
ate, from, the, masses, not, from, some, farcical, aquatic, ceremony]

你的“文本墙”确实很难弄清楚你想要什么。

答案 1 :(得分:2)

In [133]: punc = set('.,<>!@#$%^&*()-_+=]}{[\\|')

In [134]: [''.join(char for char in word if char not in punc) for word in "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.".split()]
Out[134]: 
['Listen',
 'strange',
 'women',
 "lyin'",
 'in',
 'ponds',
 "distributin'",
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony']

答案 2 :(得分:1)

我建议使用regular expression,就像这样

import re

re.findall(r'[a-zA-Z]+',input_string)

或者为了做多个字符串,首先编译正则表达式

regexp=re.compile(r'[a-zA-Z]+')
regexp.findall(test)

基本上,这是要求所有包含字母的字符,按字符分组。如果你想要包括缩小词,你可以添加'到表达式,如下:

re.findall(r'[a-zA-Z']+',input_string)