首先,我是机器学习的新手 .

我试图预测二手车的价格 . 这车有品牌和型号,所以我使用MultiLabelBinarizer制作稀疏矩阵,处理分类属性,这里是代码:

from sklearn.preprocessing import MultiLabelBinarizer
encoder = MultiLabelBinarizer()
make_cat_1hot = encoder.fit_transform(make_cat)
model_cat_1hot = encoder.fit_transform(model_cat)
type_cat_1hot = encoder.fit_transform(type_cat)

print(type(make_cat_1hot))
carInfoModHot = carsInfoMod.copy()
carInfoModHot["makeHot"] = make_cat_1hot.tolist()
carInfoModHot["modelHot"] = model_cat_1hot.tolist()
carInfoModHot["typeHot"] = type_cat_1hot.tolist()



doors   km      make        year    makeHot                       modelHot  
5.0     78779   Mercedes    2012    [0, 0, 0,  0, 1, 0, 0, 0, ...[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...  
5.0     25463   Bmw         2015    [0, 1, 0, 0, 0, 0, 0, ...   [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, ...

然后我用它来做预测并用线性回归得到均方误差:

lr = linear_model.LinearRegression()

carsInfoTrainHot = carInfoModHot.drop(["price"], axis=1) # drop labels for training set

df1 = carsInfoTrainHot.iloc[:30000, :]
carsLabels1 = carsInfoMod.iloc[:30000, 3]
print(carsInfoTrainHot.head())
df2 = carsInfoTrainHot.iloc[30001:60000, :]
carsLabels2 = carsInfoMod.iloc[30001:60000, 3]
df3 = carsInfoTrainHot.iloc[60001:, :]
carsLabels3 = carsInfoMod.iloc[60001:, 3]

lr.fit(df1, carsLabels1) 
print(carsInfoTrainHot.shape)
carPrediction = lr.predict(df2)

lin_mse = mean_squared_error(carsLabels2, carPrediction)

lin_rmse = np.sqrt(lin_mse)

但我得到这个错误:

ValueError Traceback(最近一次调用最后一次)in()12 carsLabels3 = carsInfoMod.iloc [60001:,3] 13 ---> 14 lr.fit(df1,carsLabels1)15 print(carsInfoTrainHot.shape)16 carPrediction = lr . 预测(df2)/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py in fit(self,X,y,sample_weight)510 n_jobs_ = self.n_jobs 511 X,y = check_X_y(X,y,accept_sparse = ['csr','csc','coo'], - > 512 y_numeric = True,multi_output = True)513 514 if sample_weight不是None和np.atleast_1d(sample_weight).ndim > 1:/home/vagrant/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X,y,accept_sparse,dtype,order,copy,force_all_finite,ensure_2d,allow_nd,multi_output, ensure_min_samples,ensure_min_features,y_numeric,warn_on_dtype,estimator)519 X = check_array(X,accept_sparse,dtype,order,copy,force_all_finite,520 ensure_2d,allow_nd,ensure_min_samples, - > 521 ensure_min_features,warn_on_dtype,estimator)522 if multi_output:523 y = check_array(array,accept_sparse,dtype,order ,copy,force_all_finite,ensure_2d,allow_nd,ensure_min_samples,ensure_min_features,warn_on_dtype,estimator)400#确保我们实际转换为数字:401如果dtype_numeric和array.dtype.kind ==“O”: - > 402 array = array . astype(np.float64)403如果不是allow_nd和array.ndim> = 3:404引发ValueError(“找到dim%d的数组 . %s期望<= 2.“ValueError:使用序列设置数组元素 .

据我所知,我在分类属性中插入一个数组,但我怎样才能将分类值更改为稀疏矩阵?

谢谢 .