Logistic回归 - 带分类变量的多类分类-Java 学习之路

我目前正在使用具有分类和连续功能的数据框，如下所示：

我想运行逻辑回归来预测目标值 . 在这种情况下，目标值是种族，可以是“A”，“W”，“B”，“H”，“N”或“O”，代表“亚洲人”，“白人”，“黑人” “，”西班牙裔“，”美洲原住民“或”其他“ .

我在一个名为“dummies”的新数据框中将所有功能转换为虚拟变量（“race”列除外） . 为了训练模型，我使用以下代码：

from sklearn import linear_model, metrics

X = dummies.drop("race", axis=1)
y = dummies["race"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)


from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

predictions = logmodel.predict(X_test)

我没有得到任何错误，但是，当我查看分类矩阵时，我获得了精确，召回和f1分数的完美分数1.00 . 这似乎有点太好了，不可能......我做错了什么？

这是我用来转换假人的代码：

dummies = pd.get_dummies(df[["date", "armed", "age", "gender", "city", "state", "signs_of_mental_illness", "threat_level", "flee", "body_camera", "total_population"]], drop_first=True)
dummies = pd.concat([df, dummies], axis=1)

dummies.drop(df[["date", "armed", "age", "gender", "city", "state", "signs_of_mental_illness", "threat_level", "flee", "body_camera", "total_population"]], axis=1, inplace=True)

2 回答

0

您应该使用LabelEncoder将分类功能转换为数字：http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

现在，您实际上将目标数据（尽管以不同的形式）放入列车和测试数据中，因此您获得了满分 - 模型只需将虚拟列转换回单列 . 当然，它是100％准确的 .

另外，请看这里：Multi-Class Logistic Regression in SciKit Learn

回复于 2024-05-03T06:16:51+08:00
0

The reason you are getting a classification score of perfect 1.0 is because you are treating numerical data as categorical data. 当您在数据框的所有列上使用pandas.get_dummies时，实质上是将所有日期，年龄等（即数字数据）转换为虚拟变量指示符 which is incorrect . 这是因为在这样做时，您要为数据集中的所有年龄创建虚拟变量 . 对于您的小型数据集，可以这样做，但在现实世界的情况下，这意味着1至100岁时您将拥有100种不同的可能组合！ pandas.get_dummies的描述如下：

将分类变量转换为虚拟/指标变量

这是一种使用分类的错误方法 . 我建议你只使用pandas.get_dummies（）转换分类变量，然后验证你的结果 . As for why you get 100% accuracy : 它's because you are able to account for all possible scenarios by converting even the numerical columns into categorical types using this incorrect technique(since your dataset is small, this technique won'太多了 . 但是，对于真实场景，它是不正确的） .

如果您想查看其他一些编码数据的方法，check out this link .

Your data contains numerical columns too. Account for that, only then you will get correct results.

回复于 2024-05-03T06:16:51+08:00

Logistic回归 - 带分类变量的多类分类

2 回答

相关问题