我使用Matlab kmeans函数对两个数据集进行了聚类:data1和data2。 我有三个主要文件,分别包含以下代码,
result1 = kmeans(data1, 4);
result2 = kmeans(data2, 4);
r1 = kmeans(data1,4);
r2 = kmeans(data2,4);
我注意到result1和r1相同,但是result2和r2略有不同。我相信这是由kmeans算法中的随机性引起的。在第一个和第二个文件中,首先执行data1,因此kmeans使用相同的“种子”。在第一个和第三个文件中,data2在不同的阶段执行。用于result1的kmeans对以下kmeans有影响。
我的问题是:我们可以以某种方式设置种子以使r2和result2相同吗?
答案 0 :(得分:2)
您可以使用rng
函数在MATLAB中控制随机数的生成。使用它,您可以在运行代码之前捕获随机数生成器的状态,然后在再次运行随机数生成器之前将其设置回该状态,以确保获得相同的结果。例如:
rngState1 = rng; % Capture state before processing data1
result1 = kmeans(data1, 4);
rngState2 = rng; % Capture state before processing data2
result2 = kmeans(data2, 4);
...
rng(rngState1); % Restore state previously used for processing data1
r1 = kmeans(data1,4);
...
rng(rngState2); % Restore state previously used for processing data2
r2 = kmeans(data2,4);
由于您要在单独的文件中处理数据,因此这可能意味着saving and loading到MAT文件的状态变量和从MAT文件来的状态变量可以完成我上面概述的操作。另一个选择是简单地在处理每个数据集之前将种子设置为给定值:
rng(1); % Set seed to 1 for data1
result1 = kmeans(data1, 4);
rng(2); % Set seed to 2 for data2
result2 = kmeans(data2, 4);
...
rng(1);
r1 = kmeans(data1,4);
...
rng(2);
r2 = kmeans(data2,4);
答案 1 :(得分:0)
另一种选择是使用非随机初始化:
rawData = open("full_LOTR_1.txt").read()
cleaning1 = rawData.replace("\x92", "")
cleaning2 = cleaning1.replace("\n", "")
cleaning3 = cleaning2.replace("\\", "")
cleaning4 = reg.sub(r"""["?,$!;.]|['’](?!(?<! ')[tslm])""", " ", cleaning3)
cleaning5 = cleaning4.replace(" 128d ", "")
cleaning6 = cleaning5.lower()
cleaning7 = cleaning6.replace("o/","")
cleaning8 = " ".join(cleaning7.split())
cleaning9 = cleaning8.split()
scounter = 0
for char in cleaning9:
if (char == "sauron"):
scounter = scounter + 1
print("Sauron is written " + str(scounter) + " times in 'The Fellowship of the Ring'")
fcounter = 0
for char in cleaning9:
if (char == "frodo"):
fcounter = fcounter + 1
print("Frodo is written " + str(fcounter) + " times in 'The Fellowship of the Ring'")
不要复制粘贴上面的代码,这只是出于说明目的。但是您可能有一个很好的策略来非随机地初始化您的均值,这取决于您的数据如何实现。例如,对于矩形域内的2D数据,您可以选择域的四个角。