我有一个包含两个类的脚本。 (我显然删除了许多我认为与我正在处理的错误无关的东西。)最终的任务是创建一个决策树,正如我在this问题中提到的那样。
不幸的是,我得到了一个无限循环,我很难找出原因。我已经确定了会变得混乱的代码行,但我认为迭代器和我添加的列表将是不同的对象。列表的.append功能是否有一些我不知道的副作用?或者我是否犯了一些其他令人目眩的明显错误?
class Dataset:
individuals = [] #Becomes a list of dictionaries, in which each dictionary is a row from the CSV with the headers as keys
def field_set(self): #Returns a list of the fields in individuals[] that can be used to split the data (i.e. have more than one value amongst the individuals
def classified(self, predicted_value): #Returns True if all the individuals have the same value for predicted_value
def fields_exhausted(self, predicted_value): #Returns True if all the individuals are identical except for predicted_value
def lowest_entropy_value(self, predicted_value): #Returns the field that will reduce <a href="http://en.wikipedia.org/wiki/Entropy_%28information_theory%29">entropy</a> the most
def __init__(self, individuals=[]):
和
class Node:
ds = Dataset() #The data that is associated with this Node
links = [] #List of Nodes, the offspring Nodes of this node
level = 0 #Tree depth of this Node
split_value = '' #Field used to split out this Node from the parent node
node_value = '' #Value used to split out this Node from the parent Node
def split_dataset(self, split_value): #Splits the dataset into a series of smaller datasets, each of which has a unique value for split_value. Then creates subnodes to store these datasets.
fields = [] #List of options for split_value amongst the individuals
datasets = {} #Dictionary of Datasets, each one with a value from fields[] as its key
for field in self.ds.field_set()[split_value]: #Populates the keys of fields[]
fields.append(field)
datasets[field] = Dataset()
for i in self.ds.individuals: #Adds individuals to the datasets.dataset that matches their result for split_value
datasets[i[split_value]].individuals.append(i) #<---Causes an infinite loop on the second hit
for field in fields: #Creates subnodes from each of the datasets.Dataset options
self.add_subnode(datasets[field],split_value,field)
def add_subnode(self, dataset, split_value='', node_value=''):
def __init__(self, level, dataset=Dataset()):
我的初始化代码目前是:
if __name__ == '__main__':
filename = (sys.argv[1]) #Takes in a CSV file
predicted_value = "# class" #Identifies the field from the CSV file that should be predicted
base_dataset = parse_csv(filename) #Turns the CSV file into a list of lists
parsed_dataset = individual_list(base_dataset) #Turns the list of lists into a list of dictionaries
root = Node(0, Dataset(parsed_dataset)) #Creates a root node, passing it the full dataset
root.split_dataset(root.ds.lowest_entropy_value(predicted_value)) #Performs the first split, creating multiple subnodes
n = root.links[0]
n.split_dataset(n.ds.lowest_entropy_value(predicted_value)) #Attempts to split the first subnode.
答案 0 :(得分:4)
我怀疑你正在追加你正在迭代的相同列表,导致它在迭代器到达结束之前增加大小。尝试迭代列表的副本:
for i in list(self.ds.individuals):
datasets[i[split_value]].individuals.append(i)
答案 1 :(得分:4)
class Dataset:
individuals = []
可疑。除非您希望Dataset
的所有实例共享一个静态成员列表,否则您不应该这样做。如果您在self.individuals= something
中设置__init__
,那么您也不需要在此设置individuals
。
def __init__(self, individuals=[]):
仍然怀疑。您是否将individuals
参数分配给self.individuals
?如果是,则将在函数定义时创建的相同individuals
列表分配给使用默认参数创建的每个Dataset
。将一个项添加到一个Dataset
的列表中,并且在没有明确individuals
参数的情况下创建的所有其他项也将获得该项。
类似地:
class Node:
def __init__(self, level, dataset=Dataset()):
在没有明确Node
参数的情况下创建的所有dataset
将收到完全相同的默认Dataset
实例。
这是mutable default argument problem,它产生的那种破坏性迭代似乎很可能导致你的无限循环。