如何从单独的列表中删除特定的libsvm值列表

时间:2020-06-10 20:14:53

标签: python list-comprehension sparse-matrix libsvm

def parseline(line):
    line = line.values.flatten().tolist() # flatten labeled point pandas dataframe to python list
    strLine1 = listToString(line) # custom function just converts list to string for regex operations.
    strLine2 = re.sub(r"^1:1 |2:\d+.\d+ ","",strLine1) # filter string to eliminate first two indices; python string
    splitLine = strLine2.replace("0    ", "").split(" ") # eliminate specific val; split on spaces; python list of strings

    positive = 0 # variable for presence/absence of something instantiated

    for feature in splitLine:
        featureIndex = feature.split(":")[0]
        featureValue = feature.split(":")[1]

        if featureIndex in toRemove: # toRemove is a list of vals to eliminate from each line; this works
            positive = 1 

        newLine = ""

        if positive == 1:
            newLine = [i for i in toRemove not in splitLine] # goal here is to remove values found in the toRemove from the newLine 
            newLine = "1" + " " + newLine
            print(newLine)
        else:
            newLine = "0" + " " + strLine2

        return newLine

这是我正在完成的项目的一些代码。我已经成功地产生了一个列表,其中包含不想在每一行中包含的值。所述列表称为“ toRemove”。

条件语句“ if featureIndex in toRemove”有效,由在“ toRemove”中找到的每个“ featureIndex”旁边打印“此索引需要从最终列表中删除”的打印语句确认。

问题在于,第二个条件语句(如果正== 1,否则为else)从“如果正== 1”条件返回一个列表,该列表只是“ toRemove”的副本。 “ else”条件实际上返回正确的列表。

例如

'if positive == 1:' list output:
['20', '68', '112', '264', '384', '449', '454', '749', '839',...] #this is just a copy of the 'toRemove' list

'else:' list output:
0 3:0.0 4:1 12:1 36710:1 36725:1 36791:1 86715:1 98190:1

我最初尝试将其作为数据类型问题来解决,因此,转换语句旁边的簿记注释。

我在哪里错了?

编辑: 通过“ parseline”功能发送的输入文件具有以下格式:

1:1 2:00 3:00 4:1 9:1 20:1 40:1... # say index 20 is one of the indices in 'toRemove'
1:1 2:10 3:00 45:1 85:1 99:1 100:1... # say none of the index vals in this line are in 'toRemove'

'parseline(line)'删除索引1和2,然后通过'toRemove'列表进行解析以从该列表中删除项目,从而为原始输入文件中的每一行输出'newLine'字符串。

相同的两个示例输入的“ newLine”输出应为

1 3:00 4:1 9:1 40:1... #notice index 20 is gone, and its presence in the list is accounted for by the 1 

0 3:00 45:1 85:1 99:1 100:1... #notice since none of the indices in the original list were in the 'toRemove' list, 

1 个答案:

答案 0 :(得分:0)

是数据类型的问题。问题已解决。谢谢大家。