我在python 2.7中工作。我想从列表列表中的每个列表中删除非字母字符,而不修改列表的结构。
开始列表的示例列表:
csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
print (csvarticles[0])
期望的输出:
[[' beta blockers',' magic',' 1980',' presse medicale'],['高血压在孕妇',',' 2010',' medical'],['动脉高血压',' ;',' 1920',' la nouvelle']]
代码1:
csvarticles = [[word.lower().split() for word in nodeList] for nodeList in csvarticles]
print (csvarticles[0])
代码1输出:
[' [β-受体阻滞剂]'魔术!',' 1980',' Presse medicale'] [[' [β受体阻滞剂]'],['魔术!'],[' 1980'],[' presse' ,' medicale']]
代码2:
csvarticles = [[word.lower().split() for word in nodeList if word.isalpha()] for nodeList in csvarticles]
代码2输出:
[]
代码3:
articleTitle = []
for x, y in enumerate(csvarticles):
myString = simpleWords(csvarticles[x][0])
if myString is not '':
myString = myString.lower()
myString = re.sub('[\W_]+', ' ', myString, flags=re.UNICODE)
myList = [word for word in myString.split() if len(word) > 3]
articleTitle = ' '.join(myList)
代码3输出:
[&#39;β受体阻滞剂&#39;魔术&#39;,&#39; 1980&#39;,&#39; presse medicale&#39;,&#39;高血压孕妇&# 39;,&#39; 2010&#39;,&#39;医疗&#39;,&#39;动脉高血压&#39; 1920#&nbsp;&#39; nouvelle&#39;] < / p>
代码3接近但删除了嵌套列表的结构。
答案 0 :(得分:3)
你想要替换非空格或字母字符,并修剪/小写字符串。对于那些与str.strip
链接的替代品,正则表达式非常有效。
在双列表comp中重建嵌套列表:
import re
csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
result = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in csvarticles]
print(result)
打印:
[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
如果您正在使用Python,请将lower
替换为casefold
以处理特定区域设置字符
答案 1 :(得分:1)
使用 string.isalnum()方法检查字符串是字母还是数字。
<强>演示强>
[['betablockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
<强>输出强>:
{{1}}
答案 2 :(得分:1)
如果你想在单行中这样做:
INPUT:
output = [[k.lower() for k in [' '.join(re.findall(r'[^\]\[.!-][A-z0-9]+[^\]\[.!-]', j)) for j in i]] for i in csvarticles]
输出:
[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
REGEX:
[^\]\[.!-][A-z0-9]+[^\]\[.!-]