重新安装sklearn后出错

时间:2014-07-25 17:08:50

标签: python numpy machine-learning scipy scikit-learn

一旦我将sklearn更新为更新版本,我得到以下错误 - 我不知道为什么会这样。

    Traceback (most recent call last):
    File "/Users/X/Courses/Project/SupportVectorMachine/main.py", line 95, in <module>
y, x = dmatrices(formula, data=finalDataFrame, return_type='matrix')
    File "/Library/Python/2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
    File "/Library/Python/2.7/site-packages/patsy/highlevel.py", line 156, in _do_highlevel_design
return_type=return_type)
    File "/Library/Python/2.7/site-packages/patsy/build.py", line 947, in build_design_matrices
value, is_NA = evaluator.eval(data, NA_action)
   File "/Library/Python/2.7/site-packages/patsy/build.py", line 85, in eval
return result, NA_action.is_numerical_NA(result)
   File "/Library/Python/2.7/site-packages/patsy/missing.py", line 135, in is_numerical_NA
mask |= np.isnan(arr)
   TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe'

这是与此对应的代码。我重新安装并安装了从Numpy到scipy patsy等的所有东西。但没有任何作用。

 # Merging the two dataframes - user and the tweets
 finalDataFrame =  pandas.merge(twitterDataFrame.reset_index(),twitterUserDataFrame.reset_index(),on=['UserID'],how='inner')
 finalDataFrame = finalDataFrame.drop_duplicates()
 finalDataFrame['FrequencyOfTweets'] = numpy.all(numpy.isfinite(finalDataFrame['FrequencyOfTweets']))

 # model formula, ~ means = and C() lets the classifier know its categorical data.
  formula = 'Classifier ~ InReplyToStatusID + InReplyToUserID + RetweetCount + FavouriteCount + Hashtags + UserMentionID + URL + MediaURL + C(MediaType) + UserMentionID + C(PossiblySensitive) + C(Language) + TweetLength + Location + Description + UserAccountURL + Protected + FollowersCount + FriendsCount + ListedCount + UserAccountCreatedAt + FavouritesCount + GeoEnabled + StatusesCount + ProfileBackgroundImageURL + ProfileUseBackgroundImage + DefaultProfile + FrequencyOfTweets'

  ### create a regression friendly data frame y gives the classifiers, x gives the features and gives different columns for Categorical data depending on variables. 
 y, x = dmatrices(formula, data=finalDataFrame, return_type='matrix')

 ## select which features we would like to analyze
 X = numpy.asarray(x)

2 个答案:

答案 0 :(得分:1)

我发现有时在包含字符串或其他非浮点值的数组上调用np.isnan时会出现错误。尝试使用arr.astype(float)转换np.arrays,然后再将它们传递给dmatrices。

此外,你的推文列的频率被设置为全部为False或全部为True,因为np.all返回一个标量。

答案 1 :(得分:0)

经过大量查看代码之后问题是我传递的公式希望程序使用下面的所有功能。此处&#39; UserAccountCreatedAt&#39;列的类型为datetime [ns]。我目前已经从公式中删除了这个并且没有错误,但我想知道如何最好地将其转换为数字数据以便实际传递它。这是因为分类数据在某些列前面由C处理,如下所示,而日期时间在patsy中被视为数字。

  formula = 'Classifier ~ UserAccountCreatedAt + InReplyToStatusID + InReplyToUserID + RetweetCount + FavouriteCount + Hashtags + UserMentionID + URL + MediaURL + C(MediaType) + UserMentionID + C(PossiblySensitive) + C(Language) + TweetLength + Location + Description + UserAccountURL + Protected + FollowersCount + FriendsCount + ListedCount + FavouritesCount + GeoEnabled + StatusesCount + ProfileBackgroundImageURL + ProfileUseBackgroundImage + DefaultProfile + FrequencyOfTweets'