Question

已关联Removing escaped entities from a String in Python

我的代码正在读取推文的大csv文件，并将其解析为两个词典（取决于推文的情绪）。然后，在使用translate（）方法从文本中删除所有标点符号之前，我创建了一个新的字典，并使用HTML解析器进行了一切。最后，我试图只保留大于长度= 3的单词这是我的代码：

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text.decode('ascii'))
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    shortenedText = [e.lower() and e.translate(remove_punctuation_map) for e in text.split() if len(e) >= 3 and not e.startswith(('http', '@')) ]
    print shortenedText

然而，我发现虽然我想要的大部分内容都已完成，但我仍然会得到长度为2的单词（不过长度为1）而且我的词典中有一些空白条目。
例如：

(: !!!!!! - so I wrote something last week
* enough said *
.... Do I need to say it?

产地：

[u'', u'wrote', u'something', u'last', u'week']
[u'enough', u'said']
[u'', u'need', u'even', u'say', u'it']

我的代码出了什么问题？如何删除所有小于2的单词，包括空白条目？

Answer 1

我认为你的问题是当你测试len（e）＆gt; = 3时，e仍然包含标点符号，所以“它？”没有过滤掉。也许分两步完成？清除标点符号，然后过滤尺寸？

像

这样的东西

cleanedText = [e.translate(remove_punctuation_map).lower() for e in text.split() if not e.startswith(('http', '@')) ]
shortenedText = [e for e in cleanedText if len(e) >= 3]

代码不从字典中删除所需的值

1 个答案: