我有一个名为" str_tuple"的字符串列表。我想计算列表中第一个元素和其余元素之间的一些相似性度量。我运行以下六行代码片段。
让我感到困惑的是,每次运行代码时,结果似乎都是完全随机的。但是,我看不出我的六线中有任何随机性。
指出TruncatedSVD()有一个" random_state"论点。指定" random_state"将给出固定的结果(完全为真)。但是,如果您更改" random_state",结果将会更改。但是对于其他字符串(例如str2),无论你如何改变" random_state",结果都是一样的。事实上,这些字符串来自HOME_DEPOT Kaggle比赛。我有一个包含数千个这样的字符串的pd.Series,其中大多数都给出了非随机结果,表现得像str2(无论是什么" random_state"设置)。由于某些未知原因,str1是每次更改" random_state"时给出随机结果的示例之一。我开始认为str1的某些内在字符可能会有所不同。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
# str1 yields random results
str1 = [u'l bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']
# str2 yields non-random result
str2 = [u'angl bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']
vectorizer = CountVectorizer(token_pattern=r"\d+\.\d+|\d+\/\d+|\b\w+\b")
# replacing str1 with str2 gives non-ramdom result regardless of random_state
cmat = vectorizer.fit_transform(str1).astype(float) # sparse matrix
cmat = TruncatedSVD(2).fit_transform(cmat) # dense numpy array
cmat = Normalizer().fit_transform(cmat) # dense numpy array
sim = np.dot(cmat, cmat.T)
sim[0,1:].tolist()
答案 0 :(得分:3)
默认情况下,Truncated SVD
遵循随机算法。因此,您必须指定RandomState
值设置为numpy.random.seed
值。
cmat = TruncatedSVD(n_components=2, random_state=42).fit_transform(cmat)
<强>
Docs
强>class sklearn.decomposition.TruncatedSVD(n_components = 2,algorithm =&#39; randomized&#39;,n_iter = 5,random_state = None,tol = 0.0)
为了使它产生非随机输出,列表的起始元素必须不止一次出现。也就是说,如果str1
的起始元素是 angl , versatile 或 simpson ,那么它会给出非随机的结果。因为str2
angl 在列表的开头至少重复了一次,它不会返回随机输出。
因此,随机性是给定列表中元素出现之间的不相似性的度量。并且,在那些情况下,指定RandomState
对于生成唯一输出将是有用的。
[感谢@wen指出这一点]