首页 文章

无监督分类:为数据分配类[关闭]

提问于
浏览
-1

我有一组来自钻孔的数据,它包含每2米不同地质力学特性的信息 . 我正在尝试创建地质力学域,并将每个点分配给不同的域 .

我试图使用随机森林分类,并且不确定如何将proximty矩阵(或randomForest函数的任何结果)与标签相关联 .

到目前为止我的简陋代码如下:

dh <- read.csv("gt_1_classification.csv", header = T)

#replace all N/A with 0
dh[is.na(dh)] <- 0
library(randomForest)
dh_rf <- randomForest(dh, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)

我希望分类器自己决定域名 .

任何帮助都会很棒!

1 回答

  • 0

    Hack-R是正确的 - 首先有必要使用一些聚类(无监督学习)方法来探索数据 . 我使用R内置mtcars数据作为演示提供了一些示例代码:

    # Info on the data
    ?mtcars
    head(mtcars)
    pairs(mtcars)    # Matrix plot
    
    # Calculate the distance between each row (car with it's variables)
    # by default, Euclidean distance = sqrt(sum((x_i - y_i)^2)
    ?dist
    d <- dist(mtcars)
    d # Potentially huge matrix
    
    # Use the distance matrix for clustering
    # First we'll try hierarchical clustering
    ?hclust
    c <- hclust(d)
    c
    
    # Plot dendrogram of clusters
    plot(c)
    
    # We might want to try 3 clusters
    # need to specify either k = # of groups
    groups3 <- cutree(c, k = 3) # "g3" = "groups 3"
    # cutree(hcmt, h = 230) will give same result
    groups3
    # Or we could do several groups at once
    groupsmultiple <- cutree(c, k = 2:5)
    head(groupsmultiple)
    
    # Draw boxes around clusters
    rect.hclust(c, k = 2, border = "gray")
    rect.hclust(c, k = 3, border = "blue")
    rect.hclust(c, k = 4, border = "green4")
    rect.hclust(c, k = 5, border = "darkred")
    
    # Alternatively we can try K-means clustering
    # k-means clustering
    ?kmeans
    km <- kmeans(mtcars, 5)
    km
    
    # Graph based on k-means
    install.packages("cluster")
    require(cluster)
    clusplot(mtcars, # data frame
         km$cluster, # cluster data
         color = TRUE, # color
         lines = 3, # Lines connecting centroids
         labels = 2) # Labels clusters and cases
    

    运行您自己的数据后,请考虑群集的哪个定义捕获您感兴趣的相似程度 . 然后,您可以为每个群集创建一个具有“级别”的新变量,然后为其创建一个受监督的模型 .

    这是使用相同mtcars数据的决策树示例 . 请注意,这里我使用mpg作为响应 - 您可能希望使用基于群集的新变量 .

    install.packages("rpart")
    library(rpart)
    ?rpart
    # grow tree 
    tree_mtcars <- rpart(mpg ~ ., method = "anova", data = mtcars)
    tree_mtcars <- rpart(mpg ~ ., data = mtcars)
    
    tree_mtcars
    
    summary(tree_mtcars) # detailed summary of splits
    
    # Get R-squared
    rsq.rpart(tree_mtcars)
    ?rsq.rpart
    
    # plot tree 
    plot(tree_mtcars, uniform = TRUE, main = "Regression Tree for mpg ")
    text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
    text(tree_mtcars, use.n = TRUE, all = TRUE, cex = .8)
    

    请注意,虽然信息非常丰富,但基本决策树通常不适合预测 . 如果需要预测,还应该探索其他模型 .

相关问题