从列表列表中删除非字母字符并维护结构

时间:2018-02-28 16:27:34

标签: python python-2.7

我在python 2.7中工作。我想从列表列表中的每个列表中删除非字母字符,而不修改列表的结构。

开始列表的示例列表:

csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
print (csvarticles[0])

期望的输出:

  

[[' beta blockers',' magic',' 1980',' presse medicale'],['高血压在孕妇',',' 2010',' medical'],['动脉高血压',' ;',' 1920',' la nouvelle']]

代码1:

csvarticles = [[word.lower().split() for word in nodeList] for nodeList in csvarticles]

print (csvarticles[0])

代码1输出:

  

[' [β-受体阻滞剂]'魔术!',' 1980',' Presse medicale']   [[' [β受体阻滞剂]'],['魔术!'],[' 1980'],[' presse' ,' medicale']]

代码2:

csvarticles = [[word.lower().split() for word in nodeList if word.isalpha()] for nodeList in csvarticles]

代码2输出:

  

[]

代码3:

articleTitle = []
for x, y in enumerate(csvarticles):
    myString = simpleWords(csvarticles[x][0])
    if myString is not '':
        myString = myString.lower()
        myString = re.sub('[\W_]+', ' ', myString, flags=re.UNICODE)
        myList = [word for word in myString.split() if len(word) > 3]
        articleTitle = ' '.join(myList)

代码3输出:

  

[&#39;β受体阻滞剂&#39;魔术&#39;,&#39; 1980&#39;,&#39; presse medicale&#39;,&#39;高血压孕妇&# 39;,&#39; 2010&#39;,&#39;医疗&#39;,&#39;动脉高血压&#39; 1920#&nbsp;&#39; nouvelle&#39;] < / p>

代码3接近但删除了嵌套列表的结构。

3 个答案:

答案 0 :(得分:3)

你想要替换非空格或字母字符,并修剪/小写字符串。对于那些与str.strip链接的替代品,正则表达式非常有效。

在双列表comp中重建嵌套列表:

import re

csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]

result = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in csvarticles]

print(result)

打印:

[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]

如果您正在使用Python,请将lower替换为casefold以处理特定区域设置字符

答案 1 :(得分:1)

使用 string.isalnum()方法检查字符串是字母还是数字。

<强>演示

[['betablockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]

<强>输出

{{1}}

答案 2 :(得分:1)

如果你想在单行中这样做:

INPUT:

output = [[k.lower() for k in [' '.join(re.findall(r'[^\]\[.!-][A-z0-9]+[^\]\[.!-]', j)) for j in i]] for i in csvarticles]

输出:

[['beta blockers', 'magic', '1980', 'presse  medicale'], ['hypertension  in  the  pregnant  woman', '', '2010', 'medical'], ['arterial  hypertension', '', '1920', 'la  nouvelle']]

REGEX:

[^\]\[.!-][A-z0-9]+[^\]\[.!-]