python清洗文本数据

时间:2019-09-05 18:10:10

标签: python text data-science data-cleaning

有人会提示您清除文本数据吗?我拥有的数据在列表(master_list中),我试图创建一个循环或函数,该循环或函数将删除多余的[]符号以及None,None因此,master_list中的数据基本上只是由,

分隔的字符串

任何帮助都将不胜感激..

master_list = [['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.', 'the supply fan is running, the VFD speed output mean value is 94.3.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.', 'the supply fan is running, the VFD speed output mean value is 94.2.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.', 'the supply fan is running, the VFD speed output mean value is 94.1.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.', 'the supply fan is running, the VFD speed output mean value is 94.0.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.', 'the supply fan is running, the VFD speed output mean value is 93.9.'], None]

4 个答案:

答案 0 :(得分:1)

您想使列表变平,因此[[1, 2], [3, 4]]变为[1, 2, 3, 4]。一种方法是通过列表理解:[x for sublist in my_list for x in sublist]

但是,您的数据还包含None而不是列表,因此需要将其过滤掉。另外,子列表还可以包含None,也需要将其删除。因此[[1, 2], None, [None, 3, ""]]成为[1, 2, 3]

要执行第一部分(在需要列表时删除None值),我们可以使用or运算符sublist or []有效地将这些None替换为空列表。我们无法遍历None,但可以遍历一个空列表。

要执行第二部分(删除列表中包含的None值以及其他“假”值(例如空字符串或零),请在列表理解的末尾添加一个条件:{{ 1}}。

所以最终结果是:

[... if x]

答案 1 :(得分:0)

似乎您要的是扁平列表,而不是包含列表的列表。同时,您希望删除None对象。可以使用in this answer中描述的方法来平整列表。现在,您只需要在中间添加一个if语句即可。

master_list = [x for sublist in master_list if sublist is not None for x in sublist]

输出:

['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.',
 'the supply fan is running, the VFD speed output mean value is 94.3.',
 'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.',
 'the supply fan is running, the VFD speed output mean value is 94.2.',
 'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.',
 'the supply fan is running, the VFD speed output mean value is 94.1.',
 'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.',
 'the supply fan is running, the VFD speed output mean value is 94.0.',
 'the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.',
 'the supply fan is running, the VFD speed output mean value is 93.9.']

答案 2 :(得分:0)

通过“删除多余的[]”来表示将数组弄平。为此,创建一个新的空列表,并将每个列表添加到该列表的末尾。在Python中,当您在列表上使用+运算符时,会将它们串联起来。

new_list = []
for sublist in master_list:

    new_list += list(sublist) #cast the sublist to a list in case it is not already



为了从列表中删除不需要的对象,请创建remove_all函数以从列表中删除所有元素:

def remove_all(lst, val):
    return [item for item in lst if not item == val]



另外,此Medium article包含您在清理数据时可能要进行的更多文本转换。

================================================ =========================
如果该列表中嵌套了列表,则需要创建一个递归展平函数:

def flatten(item):
    if isinstance(item, list) is False:
        return [item]
    else:
        new_list = []
        for val in item:
            new_list += flatten(val)
        return new_list

答案 3 :(得分:0)

列出对胜利的理解。

master_list = [['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.3.', 'the supply fan is running, the VFD speed output mean value is 94.3.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.2.', 'the supply fan is running, the VFD speed output mean value is 94.2.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.1.', 'the supply fan is running, the VFD speed output mean value is 94.1.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 94.0.', 'the supply fan is running, the VFD speed output mean value is 94.0.'], None, ['the supply fan speed mean is over 90% like the fan isnt building static, mean value recorded is 93.9.', 'the supply fan is running, the VFD speed output mean value is 93.9.'], None]
master_list = [i for x in master_list if x for i in x]