我正在尝试预测olx广告的观看次数。我写了一个抓取工具来抓取所有data(50000)广告。当我执行线性回归(对1400个样本)时,我获得了66%的准确度,但之后对52000个样本进行了线性回归,降至了8%。以下是Imgcount vs Views和Price vs Views的统计信息。
我的数据有问题吗?或如何对此进行回归。我知道这个数据是两极分化的。
我想知道为什么使用大型数据集时我的准确性下降了。
谢谢您的帮助。
代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
url = '/home/msz/olx/olx/with_images.csv'
df = pd.read_csv(url, index_col='url')
df['price'] = df['price'].str.replace('.', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].str.replace('Rs', '')
df['price'] = df['price'].astype(int)
df['text'] = df['text'].str.replace(',', ' ')
df['text'] = df['text'].str.replace('\t', '')
df['text'] = df['text'].str.replace('\n', '')
X = df[['price', 'img']]
y = df['views']
print ("X is like ", X.shape)
print ("Y is like ", y.shape)
df.plot(y='views', x='img', style='x')
plt.title('ImgCount vs Views')
plt.xlabel('ImgCount')
plt.ylabel('Views')
plt.show()
df.plot(y='views', x='price', style='x')
plt.title('Price vs Views')
plt.xlabel('Price')
plt.ylabel('Views')
plt.show()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.451, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
score = regressor.score(X_test, y_test)
print('Accuracy is : ',score*100)
答案 0 :(得分:0)
回归是最基本的算法,主要适用于线性数据集,但是如果您有一个庞大且非线性的数据集,则必须使用另一种算法,例如k最近邻居或可能是决策树。但是我更喜欢使用朴素贝叶斯分类器和其他。