Question

我试图通过foreach循环（行和列）将数据从数据表填充到数据集。

我看到每一轮时间都在增长。

foreach (DataRow row in dt.Rows)
{
    stopwatch.Start();
    ds.Namespace = "http://www.schema.co";
    stdTable = new DataTable("param");

    col1 = new DataColumn("key");
    col2 = new DataColumn("value");
    stdTable.Columns.Add(col1);
    stdTable.Columns.Add(col2);
    ds.Tables.Add(stdTable);

    foreach (DataColumn col in dt.Columns)
    {
        stopwatchcolumn.Start();
        DataRow newRow;
        newRow = stdTable.NewRow();
        newRow["key"] = col.ColumnName;
        if (col.DataType == typeof(DateTime))
        {
            newRow["value"] = DateTime.Parse(row[col].ToString()).ToString("dd/MM/yyyy HH:mm:ss");
        }
        else
        {
            newRow["value"] = row[col].ToString();
        }
        stdTable.Rows.Add(newRow);
        newRow = stdTable.NewRow();
        stopwatchcolumn.Stop();
        mainLog.WrLInfo("ELAPSED COLUMNS", stopwatchcolumn.Elapsed.ToString());
    }
    stopwatch.Stop();
    mainLog.WrLInfo("ELAPSED", stopwatch.Elapsed.ToString());
    ds.AcceptChanges();
}

[消失的列] 00：00：00.0011310

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011394

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011510

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011608

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011701

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011789

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011910

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0011999

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0012306

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0012399

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0012492

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0012604

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0012697

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0012786

2019-08-20 12:50:05信息：

[失败的一栏] 00：00：00.0012888

2019-08-20 12:50:05信息：

[消失的列] 00：00：00.0013158

Answer 1

您的代码测量时间在行中是错误的

mainLog.WrLInfo("ELAPSED COLUMNS", stopwatch.Elapsed.ToString());

您应该使用秒表列而不是秒表，因此只需更正

mainLog.WrLInfo("ELAPSED COLUMNS", stopwatchcolumn.Elapsed.ToString());

并在停止秒表后使用Reset方法。

Answer 2

不确定为什么要为原始表的每一行创建一个单独的表。有很多更好的方法可以将所有数据放入一个表中。这是我要使用的代码

import pandas as pd

from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


df = pd.DataFrame({'f1':[1,5,3,4,5,16,3,1,0],
                   'f2':[0.1,0.5,0.3,0.4,0.5,1.6,0.3,0.1,1],
                   'f3':[12,41,53,13,53,13,65,24,21],
                   'f4':[1,6,3,4,4,18,5,2,5],
                   'f5':[10,15,32,41,51,168,27,13,2],
                   'result':[1,0,1,0,0,0,1,1,0]})

print(df)

x = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Printing the shape of my data before PCA
print(x.shape)

# Doing PCA to reduce number of features
pca = PCA()
fit = pca.fit(x)

pca_result = list(fit.explained_variance_ratio_)
print(pca_result)

#I see that 'f1', 'f2' and 'f3' are the most important values
#so now, my x is:
x = df[['f1', 'f2', 'f3']]
print(x.shape) #new shape of x

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

classifiers = [['Linear SVM', SVC(kernel = 'linear', gamma = 'scale')],
               ['Decission tree', DecisionTreeClassifier()],
               ['Random Forest', RandomForestClassifier(n_estimators = 100)]]


# now i use 'SelectFromModel' so that I can get the optimal number of features/columns
my_acc = 0
for c in classifiers:

    clf = c[1].fit(x_train, y_train)

    model = SelectFromModel(clf, prefit=True)
    model_score = clf.score(x_test, y_test)
    column_res = model.transform(x_train).shape
    print(model_score, column_res)
    if model_score > my_acc:

        my_acc = model_score
        column_res = model.transform(x_train).shape
        number_of_columns = column_res[1]
        my_cls = c[0]

# classifier with the best accuracy and his number of columns is:
print(my_cls)
print('Number of columns',number_of_columns)


#Can I call 'RFE' now, is it correct / good / right thing to do?
# I want to find the best column for this
my_acc = 0
for c in classifiers:

    model = c[1]
    rfe = RFE(model, number_of_columns)
    fit = rfe.fit(x_train, y_train)
    acc = fit.score(x_test, y_test)

    if acc > my_acc:
        my_acc = acc
        list_of_results = fit.support_

        final_model_name = c[0]
        final_model = c[1]

        print()

print(c[0])
print(my_acc)
print(list_of_results)

#I got the result that says that I should use second column, and In the PCA it says that first column is the most important
#Is this good / normal / correct?

将数据从数据表填充到数据集的最快方法

2 个答案: