Question

导入后，我的第一行是打印，但打印此行需要10秒。我很确定使用Bio :: EnsEMBL :: Registry;是罪魁祸首在这里，因为我可以拿走所有硬编码的引用，并且它会缓慢加载（其余的都是非常标准的导入。

use strict;
use warnings;
use lib "$ENV{HOME}/Ensembl/src/bioperl-1.6.1";
use lib "$ENV{HOME}/Ensembl/src/ensembl/modules";
use lib "$ENV{HOME}/Ensembl/src/ensembl-compara/modules";
use lib "$ENV{HOME}/Ensembl/src/ensembl-variation/modules";
use lib "$ENV{HOME}/Ensembl/src/ensembl-funcgen/modules";
use Bio::EnsEMBL::Registry;
use Data::Dumper;
use Switch;
use DBI;

#****CONNECT TO ENSEMBL****
print"Establishing connection...";

我有什么办法可以加快运行此脚本所需的时间吗？ Thakns

Answer 1

答案可能取决于##load the classifier forest = joblib.load(classifier)### put in the name of the classifer, 'filename.pkl' # Read the test data test = pd.read_csv(infile, header=0, delimiter="\t", \ quoting=3 ) #infile is testData.tsv # Verify that there are 25,000 rows and 2 columns print "Test shape(Rows, Columns of Data):", test.shape # Create an empty list and append the clean reviews one by one num_reviews = len(test["review"]) clean_test_reviews = [] print "Cleaning and parsing the test set...\n" for i in xrange(0,num_reviews): if( (i+1) % 1000 == 0 ): print "Review %d of %d\n" % (i+1, num_reviews) clean_review = review_to_words( test["review"][i] ) clean_test_reviews.append( clean_review ) # Initialize the "CountVectorizer" object, which is scikit-learn's # bag of words tool. vectorizer = CountVectorizer(analyzer = "word", \ tokenizer = None, \ preprocessor = None, \ stop_words = None, \ max_features = 5000) # Get a bag of words for the test set, and convert to a numpy array test_data_features = vectorizer.transform(clean_test_reviews) test_data_features = test_data_features.toarray() print "Test data feature shape:", test_data_features.shape # Take a look at the words in the vocabulary vocab = vectorizer.get_feature_names() print vocab # Use the random forest to make sentiment label predictions result = forest.predict(test_data_features) # Copy the results to a pandas dataframe with an "id" column and # a "sentiment" column output = pd.DataFrame( data={"id":test["id"], "sentiment":result} ) # Use pandas to write the comma-separated output file output.to_csv( outfile, index=False, quoting=3 ) # "Bag_of_Words_model.csv",之后的内容。如果涉及网络连接，则可能解释滞后。

此外，您在邮件末尾没有换行符（print"Establishing connection...";），因此，它可能会被缓冲，并且它在屏幕上显示的时间可能与到达时间不对应声明。

因此，您需要实际计算"\n"行完成所需的时间，以及完成“建立连接”部分所需的时间。

这是use Bio::EnsEMBL::Registry;变量派上用场的地方（见$^T）：

perldoc -v '$^T'

当然，切换到decent logging system可能也很有用。

如果数据库连接确实需要很长时间，那么首先要检查的是DNS是否涉及。

Answer 2

检查环境变量ENSEMBL_REGISTRY和文件/usr/local/share/ensembl_registry.conf的内容 - 他们可以指定要连接的数据库并下载其他数据库的地址。

您可以检查DNS设置以确保解析器快速回答姓名
您可以尝试在主机上禁用IPv6：如果它不是路由，并且其中一个DNS服务器以IPv6地址响应，那么当IPv6故障转移到IPv4地址时，您将有延迟。
您可以尝试制作本地注册表（这样您就不必通过互联网）
您可以尝试使用较小的“测试”注册表（以便加载较少的数据）

为什么我的perl脚本需要这么长时间才能启动？

2 个答案: