目标变量是字符/字符串时的操作

时间:2018-05-08 11:51:57

标签: python-3.x

我想训练一个分类器,我的目标变量有300个唯一值,其类型是字符/字符串

是否有一个带有pandas的自动化流程,可以自动将每个字符串转换为数字?

非常感谢

2 个答案:

答案 0 :(得分:0)

您可以使用pandas factorize方法将字符串转换为数字。 numpy.unique也可以使用,但速度会相对较慢。

答案 1 :(得分:0)

对于数据集使用pandas

import pandas as pd
# Use numpy to convert to arrays
import numpy as np
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Import the model we are using
#from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
import re
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.preprocessing import OneHotEncoder
from sklearn import datasets
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

## Import tools needed for visualization
#from sklearn.tree import export_graphviz
#import pydot
model = RandomForestClassifier(n_jobs=2, random_state=0)
vectorizer=CountVectorizer()
features = pd.read_csv('C:\\randomforest\\randomforest\\training_data.csv')

#features['Actual_input'] = features['Actual_input'].str.replace(r'[^a-zA-Z\s]+', ' ').astype('str')

#print(features.head(5))
#count of ROWS x COLUMNS
print('The shape of our features is:', features.shape)
# Descriptive statistics for each column
#enc = OneHotEncoder()
#features= enc.fit(features)
# One-hot encode the data using pandas get_dummies
#features = pd.get_dummies(features)


# Display the first 5 rows of Column number 5 onwards
#print(features.iloc[:,5:].head(5))

# Labels are the values we want to predict
labels = features['target']
# Remove the labels from the features
# axis 1 refers to the columns
features= features.drop('target', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
#features = np.array(features)


# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = 
train_test_split(features['Actual_input'], labels, random_state = 0)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)




X_train_counts = vectorizer.fit_transform(features['Actual_input'])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
rf = model.fit(X_train_tfidf, labels)
#Y_train = vectorizer.fit_transform(train_labels).toarray()
# The baseline predictions are the historical averages

# Baseline errors, and display average baseline error
#baseline_errors = abs(baseline_preds - test_labels)
#print('Average baseline error: ', round(np.mean(baseline_errors), 2))

# Instantiate model with 1000 decision trees
#rf = RandomForestClassifier()
print("\t\t\t\t\t\t\t",rf)

# Train the model on training data

#$rf.fit(X_train,train_labels );
print (rf.score(X_train_tfidf, labels ))
# Use the forest's predict method on the test data
X_train_tokens = vectorizer.get_feature_names()
print(X_train_tokens[-50:])
test_features_data = 
pd.read_csv('C:\\randomforest\\randomforest\\test_data.csv')
predictions = 
rf.predict(vectorizer.transform(test_features_data['Actual_input']))
pred_data = pd.DataFrame(predictions, columns=["prediction"])
print(pred_data)
pred_output = pd.DataFrame(test_features_data, columns=feature_list)
pred_output["prediction"] = pred_data["prediction"].apply(np.round)
print(pred_output)
pred_output.to_csv("output.csv")