bootstrapping自己的内置函数pvclust不起作用-Java 学习之路

我正在使用序列分析方法来测量不同“空间使用序列”之间的相似性，表示为字符串 . 以下是两个序列的三个类别（A：City，B：Agriculture，C：Mountain）的理论示例：

t1，t2，........ tx
个人1：A A A B B B C C.
个人2：A B B B A A C C.
0 1 1 0 1 1 0 0 = ** 4 **

我们用来测量序列之间相似性的距离度量是汉明距离（即测量序列中的字符需要被替换以等同于序列的频率，在上面的示例中，需要按顺序替换 4 个字符等同于序列） . 基于在计算汉明距离之后获得的距离矩阵（给出每个可能的序列对的距离或不相似性），使用Ward（ward.D2）的聚类方法创建树形图 .

现在，我还想包括一个很好的集群稳健性度量，以便识别相关的集群 . 为此，我试图使用pvclust，它包含几种方法来计算bootstrap值，但是限制为一些距离测量 . 在未发布的pvclust版本中，我试图实现正确的距离测量（即汉明距离），并尝试创建一个自举树 . 该脚本正在运行，但结果不正确 . 使用nboot 1000在我的数据集上应用，“bp”值接近0，所有其他值“au”，“se.au”，“se.bp”，“v”，“c”，“pchi”是0，表明这些集群是人工制品 .

这里我提供一个示例脚本：

数据涉及非常均匀的模拟序列（例如，继续使用1个特定状态），因此每个簇肯定是重要的 . 我将靴子数量限制在10只以限制计算时间 .

####################################################################
####Create the sequences#### 
dfr = data.frame()
a = list(dfr)
b = list(dfr)
c = list(dfr)
d = list(dfr)
data = list(dfr)

for (i in c(1:10)){
set.seed(i)
a[[i]] <- sample(c(rep('A',10),rep('B', 90)))
b[[i]] <- sample(c(rep('B',10),rep('A', 90)))
c[[i]] <- sample(c(rep('C',10),rep('D', 90)))
d[[i]] <- sample(c(rep('D',10),rep('C', 90)))
}
a = as.data.frame(a, header = FALSE)
b = as.data.frame(b, header = FALSE)
c = as.data.frame(c, header = FALSE)
d = as.data.frame(d, header = FALSE)

colnames(a) <- paste(rep('seq_urban'),rep(1:10), sep ='')
colnames(b) <- paste(rep('seq_agric'),rep(1:10), sep ='')
colnames(c) <- paste(rep('seq_mount'),rep(1:10), sep ='')
colnames(d) <- paste(rep('seq_sea'),rep(1:10), sep ='')

data = rbind(t(a),t(b),t(c),t(d))
#####################################################################

####Analysis####
## install packages if necessary
#install.packages(c("TraMineR", "devtools")) 
library(TraMineR)
library(devtools)

source_url("https://www.dropbox.com/s/9znkgks1nuttlxy/pvclust.R?dl=0") # url    to my dropbox for unreleased pvclust package
source_url("https://www.dropbox.com/s/8p6n5dlzjxmd6jj/pvclust-internal.R?dl=0") # url to my dropbox for unreleased pvclust package

dev.new()
par( mfrow = c(1,2))
## Color definitions and alphabet/labels/scodes for sequence definition
palet <- c(rgb(230, 26, 26, max = 255), rgb(230, 178, 77, max = 255),     "blue", "deepskyblue2") # color palet used for the states
s.alphabet <- c("A", "B", "C", "D") # the alphabet of the sequence object
s.labels <- c("country-side", "urban", "sea", "mountains") # the labels of    the sequence object
s.scodes <- c( "A", "U", "S", "M") # the states of the sequence object

## Sequence definition
seq_ <- seqdef(data, # data  
                  1:100, # columns corresponding to the sequence data  
                  id = rownames(data), # id of the sequences
                  alphabet = s.alphabet, states = s.scodes, labels = s.labels, 
                  xtstep = 6, 
                  cpal = palet) # color palet 

##Substitution matrix used to calculate the hamming distance
Autocor <- seqsubm(seq_, method = "TRATE", with.missing = FALSE) 

# Function with the hamming distance (i.e. counts how often a character  needs to be substituted to equate two sequences to each other. Result is a  distance matrix giving the distances for each pair of sequences)
hamming <- function(x,...) {
res <- seqdist(x, method = "HAM",sm = Autocor)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}

## Perform the bootstrapping using the distance method "hamming"
result <- pvclust(seq_, method.dist = hamming, nboot = 10, method.hclust =  "ward")
result$hclust$labels <- rownames(test[,1])
plot(result)

为了做这个分析，我使用的是未发布版本的R包pvclust，它允许使用你自己的距离方法（在这种情况下：汉明） . 有人知道如何解决这个问题吗？

1 回答

1
pvclust 的目的是聚类变量（或属性）而不是案例 . 这就是为什么你的结果没有意义 . 你可以试试
```
data(iris)
res <- pvclust(iris[, 1:4])
plot(res)
```
要测试一组案例的稳定性，可以使用 fpc 包中的 clusterboot . 请在此处查看我的回答：Measuring reliability of tree/dendrogram (Traminer)

在您的示例中，您可以使用：
```
library(fpc)
ham <- seqdist(seq_, method="HAM",sm = Autocor)
cf2 <- clusterboot(as.dist(ham), clustermethod=disthclustCBI, k=4, cut="number", method="ward.D")
```
例如， k=10 你会得到不好的结果，因为你的数据真的有4个集群（通过构造） .
回复于 2024-05-08T03:48:19+08:00

bootstrapping自己的内置函数pvclust不起作用

1 回答

相关问题