首页 文章

构建一个循环的系统发育树

提问于
浏览
2

我有一个他们相关的基因和疾病表 . 我想构建一个系统发育树并将基因分组到他们的疾病 . 下面是一个样本数据集,其中gene1列属于疾病1,基因2属于疾病2.主要是gene1和gene2彼此相关,并映射到它们所属的疾病 .

gene1   gene2   disease1           disease2
AGTR1   ACHE    cancer              tumor
AGTR1   ACHE    parkinson's         asthma
ALOX5   ADRB1   myocardial infarct  heart failure
AR      ADORA1  breast cancer       anxiety disorder

我希望有一个循环的系统发育树用于我的目的,在下面的链接中给出:http://itol.embl.de/itol.cgi

有任何建议在R或任何软件中执行此操作?

谢谢
enter image description here

我正在运行的代码,现在:

d=read.csv("genes_disease.txt",sep="\t",header=TRUE)
phyl_gad <-as.phylo(hclust(dist(d)))
 plot(phyl_gad,type="fan",edge.col=c("red","green","blue","orange","yellow","pink","magenta","white"),show.tip.label=FALSE)

如果我执行show.tip.label = TRUE,则会有太多标签被绘制并使提示变得杂乱无章 .

我修改过的数据集现在只有两列,一列用于基因,一列用于疾病 .

2 回答

  • 4

    啊,我以前做过这个 . 正如布莱恩所说,你想使用 ape 包 . 假设你有一个 hclust 对象 . 例如,

    library(ape)
    fit<-hclust(d,method='ward')
    plot(as.phylo(fit),type='fan',label.offset=0.1,no.margin=TRUE)
    

    如果要修改树末端的颜色,可以使用 cutreetip.color 参数 . 这将为不同的簇创建一组重复的颜色(例如, color=c('red','blue') 将在分支的末尾具有交替的蓝色和红色文本 .

    nclus=...#insert number of clusters you want to cut to
    color=...#insert a vector of colors here
    fit<-hclust(d,method='ward')
    color_list=rep(color,nclus/length(color))
    clus=cutree(fit,nclus)
    plot(as.phylo(fit),type='fan',tip.color=color_list[clus],label.offset=0.1,no.margin=TRUE)
    

    我不确定你想使用什么类型的聚类方法(我使用的是Ward的方法),但这就是你如何做到的 .

  • 0

    我想你想做的不是系统发育,而是远距离聚类 . 这是一个可重复的例子 .

    library(XML)
    library(RCurl)#geturl
    library(rlist)
    library(plyr)
    library(reshape2)
    library(ggtree)
    
    #get the genes/ diseases info from internet
    #example from http://www.musclegenetable.fr/
    urllist<-paste0("http://195.83.227.65/4DACTION/GS/",LETTERS[1:24] )
    theurl <- lapply(urllist, function(x) RCurl::getURL(x,.opts = list(ssl.verifypeer = T) ) )# wait
    theurl2<-lapply(theurl, function(x) gsub("<span class='Style18'>","__",x))
    tables <- lapply(theurl2, function (x) XML::readHTMLTable(x) )
    tables2 <- lapply(tables, function(x) rlist::list.clean(x, fun = is.null, recursive = FALSE) )
    unlist1 = lapply(tables2, plyr::ldply)
    newdf<-do.call(rbind, unlist1)
    colnames(newdf)[4]<-"diseases"
    colnames(newdf)[2]<-"Gene"
    newdf$gene<-sub("([A-z0-9]+)(__)(.*)","\\1",newdf$Gene)
    newdf$diseases<-sub("(\\* )","",newdf$diseases, perl=T)
    
    #split info of several diseases per gene, and simplify text 
    #to allow better clustering
    newdf2<-as.data.frame(data.table::setDT(newdf)[, strsplit(as.character(diseases), "* ", fixed=TRUE), by = .(gene, diseases)
    ][,.(diseases = V1, gene)])
    newdf2$disease<-sub("([A-z0-9,\\-\\(\\)\\/ ]+)( \\- )(.*)","\\1",newdf2$diseases)
    newdf2$disease<-gsub("[0-9,]","",newdf2$disease)
    newdf2$disease<-gsub("( [A-Z]{1,2})$","",newdf2$disease)
    newdf2$disease<-gsub("(\\-)","",newdf2$disease)
    newdf2$disease<-gsub("\\s*\\([^\\)]+\\)","",newdf2$disease)
    newdf2$disease<-gsub("\\s*type.*","",newdf2$disease, ignore.case = T)
    newdf2$disease<-gsub("(X{0,3})(IX|IV|V?I{0,3})","", newdf2$disease)
    newdf2$disease<-gsub("( [A-z]{1,2})$","",newdf2$disease)
    newdf2$disease<-sub("^([a-z])(.*)","\\U\\1\\E\\2",newdf2$disease, perl=T)
    newdf2$disease<-trimws(newdf2$disease)
    newdf2<-newdf2[,c(2,3)]
    
    #make clustering and tree
    newcasted <- reshape2::dcast(newdf2, gene ~ disease)
    phyl_gad <-ape::as.phylo(hclust(dist(newcasted)))
    
    #use names of genes and diseases in tree
    DT <- data.table::as.data.table(newdf2)
    newdf4<-as.data.frame(DT[, lapply(.SD, paste, collapse=","), by = gene, .SDcols = 2])
    newdf4$genemerge<-paste(newdf4$gene, newdf4$disease)
    phyl_gad$tip.label<-newdf4$genemerge
    
    #plot tree
    ggtree::ggtree(phyl_gad, layout = "circular")+ ggtree::geom_tiplab2(offset=0.1, align = F, size=4)
    

    enter image description here

相关问题