Question

我有一个像这样的列表。我可以使用python从列表中删除 \ xe2 \ x80 \ x99，\ xe2 \ x80 \ x9c等。无论如何要从我的列表中消除这些数据？常见模式是可用的吗？

[＆＃39; guest＆＃39;，＆＃39; demo＆＃39;，＆＃39;：＆＃39;，＆＃39; eric＆＃39;，＆＃39; iverson＆＃39;，＆＃39; \ xe2 \ x80 \ x99s＆＃39;，＆＃39; itty＆＃39;，＆＃39; bitty＆＃39;，＆＃39; search＆＃39;，＆＃39; february＆＃39;，＆＃39; 16＆＃39;，＆＃39; th＆＃39;，＆＃39;，＆＃39;，＆＃39; 2010＆＃39;，＆＃39; by＆＃39;，＆＃39; daniel＆＃39;，＆＃39; tunkelang＆＃39;，＆＃39;回复＆＃39;，＆＃39; i＆＃39;，＆＃39; \ xe2 \ x80 \ x99m＆＃39;，＆＃39;背面＆＃39 ;, ＆＃39;来自＆＃39;，＆＃39;度假＆＃39;，＆＃39;，＆＃39;，＆＃39;和＆＃39;，＆＃39;仍然＆＃39;，＆＃39;挖掘＆＃39;，＆＃39;我的＆＃39;＆＃39;方式＆＃39;，＆＃39; out＆＃39;，＆＃39;＆＃39;，＆＃39;所有＆＃39;，＆＃39;＆＃39;，＆＃39; \ xe2 \ x80 \ x99s＆＃39;，＆＃39;堆积＆＃39;，＆＃39; up＆＃39;，＆＃39;而＆＃39;，＆＃39; i＆＃39;，＆＃39; \ xe2 \ x80 \ x99ve＆＃39;，＆＃39;已经＆＃39;，＆＃39;离线＆＃39;，＆＃39;而＆＃39;，＆＃39; i＆＃39;，＆＃39; catch＆＃39;，＆＃39; up＆＃39;，＆＃39;，＆＃39;，＆＃39;我＆＃39;，＆＃39;思考＆＃39;，＆＃39;我＆＃39;，＆＃39; \ xe2 \ x80 \ x99d＆＃39;，＆＃39;分享＆＃39;，＆＃39;＆＃39;，＆＃39;，＆＃39; a＆＃39;，＆＃39; demo＆＃39;，＆＃39;＆＃39;，＆＃39; eric＆＃39;，＆＃39; iverson＆＃39;，＆＃39;是＆＃39;，＆＃39;亲切＆＃39;，＆＃39;足够＆＃39;，＆＃39;＆＃39;，＆＃39;分享＆＃39;，＆＃39;与＆＃39;，＆＃39; me＆＃39;，＆＃39;它＆＃39;，＆＃39;使用＆＃39;，＆＃39; yahoo＆＃39;，＆＃39;！＆＃39;，＆＃39; boss＆＃39;，＆＃39; to＆＃39;，＆＃39;支持＆＃39;，＆＃39; ＆＃39;，＆＃39;探索＆＃39;，＆＃39;搜索＆＃39;，＆＃39;体验＆＃39;，＆＃39;＆＃39;＆＃39; top＆＃39;，＆＃39;＆＃39;＆＃39; a＆＃39;＆＃39;＆＃39; general＆＃39;，＆＃39;网络＆＃39;，＆＃39;搜索＆＃39;，＆＃39;引擎＆＃39;，＆＃39;当＆＃39;，＆＃39;你＆＃39;，＆＃39;执行＆＃39;，＆＃39; a＆＃39;，＆＃39;查询＆＃39;，＆＃39;，＆＃39;，＆＃39;＆＃39;，＆＃39;应用＆＃39;，＆＃39;检索＆＃39;＆＃39;＆＃39;，＆＃39;设置＆＃39;＆＃39;＆＃39;，＆＃39;＆＃39;＆＃39; && 39; ＃39;，＆＃39; term＆＃39;，＆＃39;考生＆＃39;，＆＃39;使用＆＃39;，＆＃39; yahoo＆＃39;，＆＃39; \ xe2 \ x80 \ x99s＆＃39;，＆＃39; key＆＃39;，＆＃39;术语＆＃39 ;, ＆＃39; api＆＃39;，＆＃39;它＆＃39;，＆＃39;然后＆＃39;，＆＃39;得分＆＃39;，＆＃39;每个＆＃39;＆＃39; term＆＃39;，＆＃39;通过＆＃39;，＆＃39;划分＆＃39;，＆＃39;它＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;结果＆＃39;，＆＃39;设置＆＃39;，＆＃39; by＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;全球＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39; \ XE2 \ X80 \ x93a＆＃39 ;, ＆＃39;相关性＆＃39;，＆＃39;衡量＆＃39;，＆＃39;类似＆＃39;，＆＃39;到＆＃39;，＆＃39;一个＆＃39;，＆＃39;我的＆＃39;，＆＃39;前＆＃39;，＆＃39;同事＆＃39;＆＃39;和＆＃39;，＆＃39;我＆＃39;，＆＃39;＆＃39;＆＃39; at＆＃39;，＆＃39; endeca＆＃39;，＆＃39; in＆＃39;，＆＃39; enterprise＆＃39;，＆＃39;上下文＆＃39;，＆＃39;你＆＃39;，＆＃39;＆＃39;＆＃39;尝试＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;，＆＃39;演示＆＃39;，＆＃39;你自己＆＃39;，＆＃39; at＆＃39;，＆＃39; http＆＃39;，＆＃39;：// www＆＃39;，＆＃39; ittybittysearch＆＃39;，＆＃39; com＆＃39;，＆＃ 39; /＆＃39;，＆＃39;而＆＃39;，＆＃39; it＆＃39;，＆＃39;＆＃39;＆＃39; rough＆＃39;，＆＃39; edge＆＃39;，＆＃39;，＆＃39;，＆＃39; it＆＃39;，＆＃39;产生＆＃39;，＆＃39; nice＆＃39;，＆＃39;结果＆＃39;，＆＃39; \ xe2 \ x80 \ x93特别＆＃39;，＆＃39;考虑＆＃39;，＆＃39;＆＃39;简单＆＃39;＆＃39;＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;，＆＃39; here＆＃39;，＆＃39; \ xe2 \ x80 \ x99s＆＃39;，＆＃39; an＆＃39;，＆＃39;示例＆＃39;，＆＃39; of＆＃39;，＆＃39;＆＃39;，＆＃39;我＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;，＆＃39;探索＆＃39;，＆＃39;和＆＃39;，＆＃39;学习＆＃39;，＆＃39;＆＃39;＆＃39; new＆＃39;，＆＃39;我＆＃39;，＆＃39;开始＆＃39;，＆＃39;＆＃39;，＆＃39; [＆＃34;＆＃39;，＆＃39;信息＆＃39;，＆＃39;检索＆＃39;＆＃39;＆＃34;]＆＃39;，＆＃39;我＆＃39;，＆＃39;注意到＆＃39;，＆＃39; \ XE2 \ X80 \ x9c＆＃39 ;, ＆＃39;互动＆＃39;，＆＃39;信息＆＃39;，＆＃39;检索＆＃39; \＃39; \ xe2 \ x80 \ x9d＆＃39;，＆＃39; as＆＃39;，＆＃39;一个＆＃39 ;, ＆＃39; top＆＃39;，＆＃39; term＆＃39;，＆＃39;，＆＃39;，＆＃39; so＆＃39;，＆＃39; i＆＃39;，＆＃39;使用＆＃39;，＆＃39;它＆＃39;，＆＃39;，＆＃39;，＆＃39;精简＆＃39;，＆＃39;大多数＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;看起来＆＃39;，＆＃39;熟悉＆＃39;，＆＃39;到＆＃39;，＆＃39;我＆＃39;，＆＃39; \ xe2 \ x80 \ x93但＆＃39;，＆＃39;＆＃39;，＆＃39;陌生＆＃39;，＆＃39;名称＆＃39;，＆＃39;抓住了＆＃39;，＆＃39;我的＆＃39;，＆＃39;关注＆＃39;，＆＃39;，＆＃39; \ xe2 \ x80 \ x9c＆＃39;，＆＃39; anton＆＃39;，＆＃39; leuski＆＃39; ，＆＃39; \ xe2 \ x80 \ x9d＆＃39;，＆＃39;关注＆＃39;，＆＃39;我的＆＃39;，＆＃39;好奇心＆＃39;，＆＃39;，＆＃39;，＆＃39;我＆＃39;，＆＃39;精致＆＃39;，＆＃39;再次＆＃39;，＆＃39;查看＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;我立刻＆＃39;立即＆＃39;＆＃39;看到＆＃39;，＆＃39;＆＃39;，＆＃39; leuski＆＃39;，＆＃39;＆＃39;，＆＃39;完成＆＃39;，＆＃39;工作＆＃39;，＆＃39; on＆＃39;，＆＃39;评估＆＃39;，＆＃39;记录＆＃39;，＆＃39;群集＆＃39;，＆＃39;对于＆＃39;，＆＃39;互动＆＃39;，＆＃39;信息＆＃39;，＆＃39;检索＆＃39;，＆＃39;进一步＆＃39;，＆＃39;探索＆＃39;＆＃39;制作＆＃39;，＆＃39;它＆＃39;＆＃39;清除＆＃39;，＆＃39;这个＆＃39;，＆＃39;是＆＃39;，＆＃39;某人＆＃39;，＆＃39;他们的工作＆＃39;＆＃39;我＆＃39;＆＃39;应该＆＃39;，＆＃39; get＆＃39;，＆＃39; to＆＃39;，＆＃39; know＆＃39;，＆＃39; \ xe2 \ x80 \ x93check＆＃39;，＆＃39; out＆＃39;，＆＃39;他的＆＃39;，＆＃39; home＆＃39;，＆＃39; page＆＃39;，＆＃39;！＆＃39;，＆＃39;我＆＃39;，＆＃39;可以＆＃39;，＆＃39; \ xe2 \ x80 \ x99t＆＃39;，＆＃39;承诺＆＃39;，＆＃39;那＆＃39;，＆＃39;你＆＃39;＆＃39; \ xe2 \ x80 \ x99ll＆＃39;，＆＃39;有＆＃39;，＆＃39; as＆＃39;，＆＃39;生产＆＃39;，＆＃39;＆＃39;＆＃39;体验＆＃39;＆＃39;＆＃39;＆＃39;＆＃39; i＆＃39;，＆＃39;做了＆＃39;，＆＃39;，＆＃39;，＆＃39;但是＆＃39;，＆＃39;我＆＃39;，＆＃39;鼓励＆＃39;，＆＃39;，＆＃39;到＆＃39;，＆＃39;尝试＆＃39;＆＃39; eric＆＃39;，＆＃39; \ xe2 \ x80 \ x99s＆＃39;，＆＃39; demo＆＃39;，＆＃39;它＆＃39;＆＃39; \ xe2 \ x80 \ x99s＆＃39;，＆＃39;简单＆＃39;，＆＃39;示例＆＃39;，＆＃39;喜欢＆＃39;，＆＃39;这些＆＃39;，＆＃39;那＆＃39;，＆＃39;提醒＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;，＆＃39;追求＆＃39;，＆＃39; hcir＆＃39;，＆＃39; for＆＃39;，＆＃39;＆＃39;，＆＃39;打开＆＃39;＆＃39;网络＆＃39;，＆＃39;说＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39;，＆＃39;，＆＃39; hcir＆＃39;，＆＃39; 2010＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;我们＆＃39;＆＃39; \ xe2 \ x80 \ x99ll＆＃39;，＆＃39;肉体＆＃39;，＆＃39; out＆＃39;，＆＃39;＆＃39;＆＃39;详情＆＃39;，＆＃39; over＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;＆＃39;周＆＆＃39;，＆＃39;，＆＃39;，＆＃39;和＆＃39;，＆＃39; of＆＃39;，＆＃39;当然＆＃39;，＆＃39; i＆＃39;，＆＃39; \ xe2 \ x80 \ x99ll＆＃39;，＆＃39;分享＆＃39;，＆＃39;他们＆＃39;，＆＃39;这里＆＃39;]

Answer 1

如果我猜测输入是utf8编码，你可以这样做：

>>> from unidecode import unidecode
>>> my_list = ['guest', 'demo', ':', 'eric', 'iverson', '\xe2\x80\x99s', 'itty', 'bitty', 'search', 'february', '16', 'th', ',', '2010', 'by', 'daniel', 'tunkelang', 'respond', 'i', '\xe2\x80\x99m', 'back', 'from', 'vacation', ',', 'and', 'still', 'digging', 'my', 'way', 'out', 'of', 'everything', 'that', '\xe2\x80\x99s', 'piled', 'up', 'while', 'i', '\xe2\x80\x99ve', 'been', 'offline', 'while', 'i', 'catch', 'up', ',', 'i', 'thought', 'i', '\xe2\x80\x99d', 'share', 'with', 'you', 'a', 'demo', 'that', 'eric', 'iverson', 'was', 'gracious', 'enough', 'to', 'share', 'with', 'me', 'it', 'uses', 'yahoo', '!', 'boss', 'to', 'support', 'an', 'exploratory', 'search', 'experience', 'on', 'top', 'of', 'a', 'general', 'web', 'search', 'engine', 'when', 'you', 'perform', 'a', 'query', ',', 'the', 'application', 'retrieves', 'a', 'set', 'of', 'related', 'term', 'candidates', 'using', 'yahoo', '\xe2\x80\x99s', 'key', 'terms', 'api', 'it', 'then', 'scores', 'each', 'term', 'by', 'dividing', 'it', 'is', 'occurrence', 'count', 'within', 'the', 'result', 'set', 'by', 'it', 'is', 'global', 'occurrence', 'count', '\xe2\x80\x93a', 'relevance', 'measure', 'similar', 'to', 'one', 'my', 'former', 'colleagues', 'and', 'i', 'used', 'at', 'endeca', 'in', 'enterprise', 'contexts', 'you', 'can', 'try', 'out', 'the', 'demo', 'yourself', 'at', 'http', '://www', 'ittybittysearch', 'com', '/', 'while', 'it', 'has', 'rough', 'edges', ',', 'it', 'produces', 'nice', 'results', '\xe2\x80\x93especially', 'considering', 'the', 'simplicity', 'of', 'the', 'approach', 'here', '\xe2\x80\x99s', 'an', 'example', 'of', 'how', 'i', 'used', 'the', 'application', 'to', 'explore', 'and', 'learn', 'something', 'new', 'i', 'started', 'with', '["', 'information', 'retrieval', '"]', 'i', 'noticed', '\xe2\x80\x9c', 'interactive', 'information', 'retrieval', '\xe2\x80\x9d', 'as', 'a', 'top', 'term', ',', 'so', 'i', 'used', 'it', 'to', 'refine', 'most', 'of', 'the', 'refinement', 'suggestions', 'looked', 'familiar', 'to', 'me', '\xe2\x80\x93but', 'an', 'unfamiliar', 'name', 'caught', 'my', 'attention', ':', '\xe2\x80\x9c', 'anton', 'leuski', '\xe2\x80\x9d', 'following', 'my', 'curiosity', ',', 'i', 'refined', 'again', 'looking', 'at', 'the', 'results', ',', 'i', 'immediately', 'saw', 'that', 'leuski', 'had', 'done', 'work', 'on', 'evaluating', 'document', 'clustering', 'for', 'interactive', 'information', 'retrieval', 'further', 'exploration', 'made', 'it', 'clear', 'this', 'is', 'someone', 'whose', 'work', 'i', 'should', 'get', 'to', 'know', '\xe2\x80\x93check', 'out', 'his', 'home', 'page', '!', 'i', 'can', '\xe2\x80\x99t', 'promise', 'that', 'you', '\xe2\x80\x99ll', 'have', 'as', 'productive', 'an', 'experience', 'as', 'i', 'did', ',', 'but', 'i', 'encourage', 'you', 'to', 'try', 'eric', '\xe2\x80\x99s', 'demo', 'it', '\xe2\x80\x99s', 'simple', 'examples', 'like', 'these', 'that', 'remind', 'me', 'of', 'the', 'value', 'of', 'pursuing', 'hcir', 'for', 'the', 'open', 'web', 'speaking', 'of', 'which', ',', 'hcir', '2010', 'is', 'in', 'the', 'works', 'we', '\xe2\x80\x99ll', 'flesh', 'out', 'the', 'details', 'over', 'the', 'next', 'weeks', ',', 'and', 'of', 'course', 'i', '\xe2\x80\x99ll', 'share', 'them', 'here']
>>> my_clean_list = [unidecode(x.decode('utf8')) for x in my_list]
>>> my_clean_list
['guest', 'demo', ':', 'eric', 'iverson', "'s", 'itty', 'bitty', 'search', 'february', '16', 'th', ',', '2010', 'by', 'daniel', 'tunkelang', 'respond', 'i', "'m", 'back', 'from', 'vacation', ',', 'and', 'still', 'digging', 'my', 'way', 'out', 'of', 'everything', 'that', "'s", 'piled', 'up', 'while', 'i', "'ve", 'been', 'offline', 'while', 'i', 'catch', 'up', ',', 'i', 'thought', 'i', "'d", 'share', 'with', 'you', 'a', 'demo', 'that', 'eric', 'iverson', 'was', 'gracious', 'enough', 'to', 'share', 'with', 'me', 'it', 'uses', 'yahoo', '!', 'boss', 'to', 'support', 'an', 'exploratory', 'search', 'experience', 'on', 'top', 'of', 'a', 'general', 'web', 'search', 'engine', 'when', 'you', 'perform', 'a', 'query', ',', 'the', 'application', 'retrieves', 'a', 'set', 'of', 'related', 'term', 'candidates', 'using', 'yahoo', "'s", 'key', 'terms', 'api', 'it', 'then', 'scores', 'each', 'term', 'by', 'dividing', 'it', 'is', 'occurrence', 'count', 'within', 'the', 'result', 'set', 'by', 'it', 'is', 'global', 'occurrence', 'count', '-a', 'relevance', 'measure', 'similar', 'to', 'one', 'my', 'former', 'colleagues', 'and', 'i', 'used', 'at', 'endeca', 'in', 'enterprise', 'contexts', 'you', 'can', 'try', 'out', 'the', 'demo', 'yourself', 'at', 'http', '://www', 'ittybittysearch', 'com', '/', 'while', 'it', 'has', 'rough', 'edges', ',', 'it', 'produces', 'nice', 'results', '-especially', 'considering', 'the', 'simplicity', 'of', 'the', 'approach', 'here', "'s", 'an', 'example', 'of', 'how', 'i', 'used', 'the', 'application', 'to', 'explore', 'and', 'learn', 'something', 'new', 'i', 'started', 'with', '["', 'information', 'retrieval', '"]', 'i', 'noticed', '"', 'interactive', 'information', 'retrieval', '"', 'as', 'a', 'top', 'term', ',', 'so', 'i', 'used', 'it', 'to', 'refine', 'most', 'of', 'the', 'refinement', 'suggestions', 'looked', 'familiar', 'to', 'me', '-but', 'an', 'unfamiliar', 'name', 'caught', 'my', 'attention', ':', '"', 'anton', 'leuski', '"', 'following', 'my', 'curiosity', ',', 'i', 'refined', 'again', 'looking', 'at', 'the', 'results', ',', 'i', 'immediately', 'saw', 'that', 'leuski', 'had', 'done', 'work', 'on', 'evaluating', 'document', 'clustering', 'for', 'interactive', 'information', 'retrieval', 'further', 'exploration', 'made', 'it', 'clear', 'this', 'is', 'someone', 'whose', 'work', 'i', 'should', 'get', 'to', 'know', '-check', 'out', 'his', 'home', 'page', '!', 'i', 'can', "'t", 'promise', 'that', 'you', "'ll", 'have', 'as', 'productive', 'an', 'experience', 'as', 'i', 'did', ',', 'but', 'i', 'encourage', 'you', 'to', 'try', 'eric', "'s", 'demo', 'it', "'s", 'simple', 'examples', 'like', 'these', 'that', 'remind', 'me', 'of', 'the', 'value', 'of', 'pursuing', 'hcir', 'for', 'the', 'open', 'web', 'speaking', 'of', 'which', ',', 'hcir', '2010', 'is', 'in', 'the', 'works', 'we', "'ll", 'flesh', 'out', 'the', 'details', 'over', 'the', 'next', 'weeks', ',', 'and', 'of', 'course', 'i', "'ll", 'share', 'them', 'here']

我在这里使用unidecode模块来转换那些＆＃34;幻想＆＃34;字符到最近的ascii等价物：

>>> for before, after in zip(my_list, my_clean_list):
...     if before != after:
...         print before, ' --> ', after
...         
’s  -->  's
’m  -->  'm
’s  -->  's
’ve  -->  've
’d  -->  'd
’s  -->  's
–a  -->  -a
–especially  -->  -especially
’s  -->  's
“  -->  "
”  -->  "
–but  -->  -but
“  -->  "
”  -->  "
–check  -->  -check
’t  -->  't
’ll  -->  'll
’s  -->  's
’s  -->  's
’ll  -->  'll
’ll  -->  'll

正如你可能猜到的那样，看起来有些英文数据应该在字边界处被分割，这是错误的。如果是您的代码生成此数据，我建议您更接近问题的来源解决您的问题！

Answer 2

看起来你想要消除一堆unicode字符串。只需在列表中选择字母数字字符，如下所示：

>>> filter( lambda m: m.isalnum()  ,p)

这应该消除unicode的东西......

另一种选择是直接对字符串进行编码和解码......

>>> ' '.join(p).decode('ascii', 'ignore').encode('ascii').split()

这应该做得更好......

Answer 3

您可以使用列表理解。假设您只想完全删除包含非字母数字字符的列表元素。如果您的列表位于变量a：

中

[x for x in a if x.isalnum()]

将返回列表，减去\xe2\x80\x99等元素

这是@ssm提到的equivalent to the filter solution，他们刚刚首先使用它。

使用python消除列表中不需要的数据

3 个答案: