Question

我正在尝试预测olx广告的观看次数。我写了一个抓取工具来抓取所有data（50000）广告。当我执行线性回归（对1400个样本）时，我获得了66％的准确度，但之后对52000个样本进行了线性回归，降至了8％。以下是Imgcount vs Views和Price vs Views的统计信息。

我的数据有问题吗？或如何对此进行回归。我知道这个数据是两极分化的。

我想知道为什么使用大型数据集时我的准确性下降了。

谢谢您的帮助。

代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
 url =  '/home/msz/olx/olx/with_images.csv'

df = pd.read_csv(url, index_col='url')


df['price'] = df['price'].str.replace('.', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].str.replace('Rs', '')
df['price'] = df['price'].astype(int)


df['text'] = df['text'].str.replace(',', ' ')
df['text'] = df['text'].str.replace('\t', '')
df['text'] = df['text'].str.replace('\n', '')

X = df[['price', 'img']]
y = df['views'] 

print ("X is like ",  X.shape)
print ("Y is like ",  y.shape)

df.plot(y='views', x='img', style='x')  
plt.title('ImgCount vs Views')  
plt.xlabel('ImgCount')  
plt.ylabel('Views')  
plt.show()

df.plot(y='views', x='price', style='x')  
plt.title('Price vs Views')  
plt.xlabel('Price')  
plt.ylabel('Views')  
plt.show()


from sklearn.model_selection import train_test_split  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.451, random_state=0)

from sklearn.linear_model import LinearRegression  
regressor = LinearRegression() 
regressor.fit(X_train, y_train) 

score = regressor.score(X_test, y_test)

print('Accuracy is : ',score*100)

Answer 1

回归是最基本的算法，主要适用于线性数据集，但是如果您有一个庞大且非线性的数据集，则必须使用另一种算法，例如k最近邻居或可能是决策树。但是我更喜欢使用朴素贝叶斯分类器和其他。

大型数据集上的回归：为什么准确性下降？

1 个答案: