Question

我正在尝试运行20news组的分类演示，我在这里下载py文件（http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py）并像往常一样运行python代码，但得到以下错误，表示存在网络连接超时错误，我有点困惑，因为我可以从提供的URL（https://ndownloader.figshare.com/files/5975967）下载数据文件，有谁知道如何解决这个问题？无论如何我可以使用manuelly下载的数据文件吗？

环境： Python 3.6 Ananconda 5.0.1

Answer 1

引自scikit-learn docs：

sklearn.datasets.fetch_20newsgroups函数是一个数据提取/缓存函数，用于从原始的20个新闻组网站下载数据存档，在〜/ scikit_learn_data / 20news_home 文件夹中提取存档内容并调用培训或测试集文件夹中的sklearn.datasets.load_files，或两者都是。

您只需将手动下载的文件解压缩到指定的文件夹即可使用。

或者，您可以通过传递fetch_20newsgroups来调用data_home='/path/to/data'函数时指定数据文件夹。将函数调用更改为：

data_train = fetch_20newsgroups(data_home='/path/to/data',
                                subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(data_home='/path/to/data',
                               subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

无法通过python代码下载20个新闻组数据

1 个答案: