首页 文章

比较两列“按序列”并制作新列

提问于
浏览
1

问题很难解释,但让我告诉你我想从这些数据中得到什么 . 所以,我有一个包含20个不同列的数据,其中有两个已在此处显示 .

Sequence             modifications
AAAAGAAAVANQGKK     [14] Acetyl (K)|[15] Acetyl (K)
AAAAGAAAVANQGKK     [14] Acetyl (K)|[15] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE   [7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)
AAIYKLLKSHFRNE      [5] Biotin (K)|[8] Acetyl (K)
AAKKFEE             [3] Acetyl (K)|[4] Acetyl (K)

正如您在相同的序列中看到的那样,可以有不同的修改 . 有时可能有3x乙酰基,simetimes 2x乙酰基,有时只有一次,在其他情况下不会有任何修饰 . 我对“生物素和乙酰基”感兴趣只有2个修改,其他修改并不重要 . 修饰的数量取决于序列中“K”的数量 . 例如,如果序列中有3个“K”,则可能的修改数量为0 0,1,2,3且不超过3.因此,我想根据“K”的数量对这些序列(1000行)进行分组 . “在顺序和修改的数量和类型,它没有粉碎其他列 .

我希望通过R从这些数据中得到它,它是具有指定修改的不同序列组 . 例如:

First Group: (number of "K" in the sequence = 2, and both modified by acetyl)

Sequence             modifications
AAAAGAAAVANQGKK      [14] Acetyl (K)|[15] Acetyl (K)
AAIYKLLKSHFRNE       [5] Acetyl (K)|[8] Acetyl (K)

Second Group: (number of "K" in the sequence = 2, and one modified by acetyl, second nothing)

Third Group: (number of "K" in the sequence = 3, and one modified by acetyl, second acetyl, and last is biotin)

我必须包括所有可能性 . 这就是我认为在我试图编写的脚本的这个“部分”上最好的东西 . 也许你有任何其他建议如何插入这些数据 .

第二个问题是:我计算了3个不同列中的值的平均值,我想将结果放在相同的数据中但在另一列中 . 怎么做 ?

tbl_imp$mean <- rowMeans(subset(tbl_imp, select = c("x", "y", "w")), na.rm = TRUE)
tbl_imp$mean <- data.frame(tbl_imp$mean)

我用来计算行的平均值的代码 . 我只是不知道如何在我拥有的数据中创建一个新列,并将我的结果放在那里 . 我应该使用转换功能吗?

2 回答

  • 0

    我将您的数据加载为对象 aa .

    mydata <-  data.frame(seqs = aa$Sequence, mods = aa$modifications) # subset of aa with sequences and modifications
    
        ##to find number of "K"s
        spl_seqs <- strsplit(as.character(mydata$seqs), split = "")  # split all sequences (use "as.character" because they are turned into factor)
        where_K <- lapply(spl_seqs, grep, pattern = "K") # find positions of "K"s in each sequence
        No_K <- lapply(where_K, length) # count "K"s in each sequence
    
        mydata$No_Ks <- No_K #add a column that informs about the number of "K"s in each sequence
        ##
    

    我想所有看似“修改”列的大写字母都是指正在进行的修改或是“K” . 我想不出任何其他方法来简化“修改”列以便操纵它们 . 所以我只是保留不是“K”的大写字母:

    names(LETTERS) <- LETTERS  # DWin's idea in this http://stackoverflow.com/questions/4423460/is-there-a-function-to-find-all-lower-case-letters-in-a-character-vector 
    
        spl_mods <- strsplit(as.character(mydata$mods), split = "")  # split the characters in each modification row
    

    简化修改列,仅保留每个修改的第一个字母:

    mods_ls <- vector("list", length = nrow(mydata))  #list to fill with simplified modifications
        for(i in 1:length(spl_mods))
         {
          res <- as.character(na.omit(LETTERS[strsplit(as.character(mydata$mods), split = "")[[i]]])) #keep only upper-case letters
    
          res <- as.character(na.omit(gsub("K", NA, res)))  # exclude "K"s 
          res <- as.character(na.omit(gsub("M", NA, res)))  # and "M"s I guessed
    
          mods_ls[[i]] <- res
         }
        mydata$simplified_mods <- unlist(lapply(mods_ls, paste, collapse = " ; "))
    

    到目前为止我们得到了什么:

    mydata[1:10,]
        #                seqs                                          mods No_Ks simplified_mods
        #1    AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
        #2    AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
        #3      AAFTKLDQVWGSE                                [5] Acetyl (K)     1               A
        #4  AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       A ; A ; A
        #5  AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       A ; A ; A
        #6  AAIKFIKFINPKINDGE                [7] Acetyl (K)|[12] Acetyl (K)     3           A ; A
        #7  AAIKFIKFINPKINDGE                 [4] Acetyl (K)|[7] Acetyl (K)     3           A ; A
        #8     AAIYKLLKSHFRNE                 [5] Biotin (K)|[8] Acetyl (K)     2           B ; A
        #9            AAKKFEE                 [3] Acetyl (K)|[4] Acetyl (K)     2           A ; A
        #10           AAKYFRE                                [3] Acetyl (K)     1               A
    

    然后,您可以对“K”的数量和所需的特定修改进行子集化 . 例如 . :

    how_many_K <- 2 
        what_mods <- "A ; A"    #separated by [space];[space]
    
        show_rows <- which(mydata$No_Ks == how_many_K & mydata$simplified_mods == what_mods)  
        mydata[show_rows,]
        #                             seqs                            mods No_Ks simplified_mods
        #1                 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
        #2                 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
        #9                         AAKKFEE   [3] Acetyl (K)|[4] Acetyl (K)     2           A ; A
        #11                     AANVKKTLVE   [5] Acetyl (K)|[6] Acetyl (K)     2           A ; A
        #14  AARDSKSPIILQTSNGGAAYFAGKGISNE  [6] Acetyl (K)|[24] Acetyl (K)     2           A ; A
        #20                        AEKLKAE   [3] Acetyl (K)|[5] Acetyl (K)     2           A ; A
        #21
        #....
    

    编辑:这一切都可以在像 fun 这样的函数中完成 . x 是您的 data.frame (与 structure 上传的"for Henrik") . noK 是您想要的"K"的数量 . mod 是您希望用[空格]分隔的修改; [空格](例如"B ; A ; O"):

    fun <- function(x, noK, no_modK = NULL, mod = NULL) #EDIT_1e: update arguments; made optional
        {
         mydata <- data.frame(seqs = x$Sequence, mods = x$modifications) 
    
         spl_seqs <- strsplit(as.character(mydata$seqs), split = "")  
         where_K <- lapply(spl_seqs, grep, pattern = "K") 
         No_K <- lapply(where_K, length)
    
         mydata$No_Ks <- No_K 
    
         names(LETTERS) <- LETTERS  
    
         spl_mods <- strsplit(as.character(mydata$mods), split = "")  
    
         mods_ls <- vector("list", length = nrow(mydata))  
         for(i in 1:length(spl_mods))
          {
           res <- as.character(na.omit(LETTERS[strsplit(as.character(mydata$mods), split = "")[[i]]])) 
    
           no_modedK <- length(grep("K", res))   #EDIT_1a: how many "K"s are modified?
    
           res <- as.character(na.omit(gsub("K", NA, res)))   
           res <- as.character(na.omit(gsub("M", NA, res)))  
    
           mods_ls[[i]] <- list(mods = res, modified_K = no_modedK) #EDIT_1b: catch number of "K"s modified (along with the actual modifications) 
          }
    
         mydata$no_modK <- unlist(lapply(lapply(lapply(mods_ls, `[`, 2), unlist), paste, collapse = " ; ")) #EDIT_1d: insert number of modified "K"s in "mydata"   
         mydata$simplified_mods <- unlist(lapply(lapply(lapply(mods_ls, `[`, 1), unlist), paste, collapse = " ; ")) #EDIT_1c: insert mods in "mydata"  
    
         if(!is.null(no_modK) & !is.null(mod)) #EDIT_1f: update "return"
          {
           show_rows <- which(mydata$No_Ks == noK & mydata$no_modK == no_modK & mydata$simplified_mods == mod) 
          }
         if(is.null(no_modK) & !is.null(mod))
          {
           show_rows <- which(mydata$No_Ks == noK & mydata$simplified_mods == mod) 
          } 
         if(is.null(mod) & !is.null(no_modK)) 
          {
           show_rows <- which(mydata$No_Ks == noK & mydata$no_modK == no_modK)
          }
    
         if(is.null(no_modK) & is.null(mod)) 
          {
           show_rows <- which(mydata$No_Ks == noK) 
          } 
    
         return(mydata[show_rows,])
        }
    

    例如 . :

    fun(aa, noK = 3) #aa is the the "for Henrik" loaded in `R` (aa <- structure( ... )
                                      seqs                                                             mods No_Ks no_modK simplified_mods
        4                AAIKFIKFINPKINDGE                    [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       3       A ; A ; A
        5                AAIKFIKFINPKINDGE                    [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       3       A ; A ; A
        6                AAIKFIKFINPKINDGE                                   [7] Acetyl (K)|[12] Acetyl (K)     3       2           A ; A
        #...
        fun(aa, noK = 3, no_modK = 2)
                                 seqs                                             mods No_Ks no_modK simplified_mods
        6           AAIKFIKFINPKINDGE                   [7] Acetyl (K)|[12] Acetyl (K)     3       2           A ; A
        7           AAIKFIKFINPKINDGE                    [4] Acetyl (K)|[7] Acetyl (K)     3       2           A ; A
        #...
    
        fun(aa, noK = 2, mod = "A ; B")
                      seqs                           mods No_Ks no_modK simplified_mods
        200    ISAMVLTKMKE [8] Acetyl (K)|[10] Biotin (K)     2       2           A ; B
        441 NLKPSKPSYYLDPE  [3] Acetyl (K)|[6] Biotin (K)     2       2           A ; B
        #...
    
        fun(aa, noK = 2, no_modK = 1, mod =  "A")
                                         seqs            mods No_Ks no_modK simplified_mods
        15      AARDSKSPIILQTSNGGAAYFAGKGISNE [24] Acetyl (K)     2       1               A
        27                     AKALVAQGVKFIAE  [2] Acetyl (K)     2       1               A
        #...
    

    EDIT_1:更新了 fun 和示例 .

  • 0

    这样的事情可能适用于你的第一部分 . 我现在无法下载文件,但是当我可以的时候,我也会尝试回复第二部分 .

    library(data.table)
    library(stringr)
    
    # Slightly modified dataset
    dataset <- data.table(
    Sequence  = c(
    'AAAAGAAAVANQGKK'    
    ,'AAAAGAAAVANQGKK'    
    ,'AAIKFIKFINPKINDGE'  
    ,'AAIKFIKFINPKINDGE'  
    ,'AAIKFIKFINPKINDGE'  
    ,'AAIKFIKFINPKINDGE'
    ,'AAIYKLLKSHFRNE'
    ,'AAKKFEE'
    ),
     modifications = c(
    '[14] Acetyl (K)|[15] Acetyl (K)'
    ,'[14] Acetyl (K)|[15] Acetyl (K)'
    ,'[4] Acetyl (K)|[7] Something (K)|[12] Acetyl (K)'
    ,'[4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)'
    ,'[7] Acetyl (K)|[12] Acetyl (K)'
    ,'[4] Acetyl (K)|[7] Acetyl (K)'
    ,'[5] Biotin (K)|[8] Acetyl (K)'
    ,'[3] Acetyl (K)'
    )
    )
    
    # get the 1st, 2nd, 3rd modifications in separate columns
    dataset <- data.table(cbind(
       dataset,
       str_split_fixed(dataset[,modifications], pattern = "\\(K\\)",3)
    ))
    
    dataset[,':='(
       V1 = as.character(V1),
       V2 = as.character(V2),
       V3 = as.character(V3)
    )]
    
    # Count of modifications    
    dataset[, NoOfKs := 3]
    dataset[V3 == "", NoOfKs := 2]
    dataset[V2 == "", NoOfKs := 1]
    dataset[V1 == "", NoOfKs := 0]
    
    # Retaining Acetyl/Biotin or no modification only
    dataset[, AB01 := TRUE]
    dataset[, AB02 := TRUE]
    dataset[, AB03 := TRUE]
    
    dataset[V1 != "",  AB01 := grepl(V1, pattern = "Acetyl|Biotin")]
    dataset[V2 != "",  AB02 := grepl(V2, pattern = "Acetyl|Biotin")]
    dataset[V3 != "",  AB03 := grepl(V3, pattern = "Acetyl|Biotin")]
    
    dataset <- dataset[AB01 & AB02 & AB03]
    
    # Marking each modification as acetyl/biotin/none
    dataset[V1 != " " & grepl(V1, pattern = "Acetyl"), AB1 := "Acetyl"]
    dataset[V1 != " " & grepl(V1, pattern = "Biotin"), AB1 := "Biotin"]
    dataset[V2 != " " & grepl(V2, pattern = "Acetyl"), AB2 := "Acetyl"]
    dataset[V2 != " " & grepl(V2, pattern = "Biotin"), AB2 := "Biotin"]
    dataset[V3 != " " & grepl(V3, pattern = "Acetyl"), AB3 := "Acetyl"]
    dataset[V3 != " " & grepl(V3, pattern = "Biotin"), AB3 := "Biotin"]
    
    dataset[
       ,
       list(
       Sequence = Sequence, 
       modifications = modifications, 
       GroupID = .GRP
       ),
       by = c('NoOfKs','AB1','AB2','AB3')
    ]
    

    产量

    NoOfKs    AB1    AB2    AB3          Sequence                                 modifications GroupID
    1:      2 Acetyl Acetyl     NA   AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)       1
    2:      2 Acetyl Acetyl     NA   AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)       1
    3:      2 Acetyl Acetyl     NA AAIKFIKFINPKINDGE                [7] Acetyl (K)|[12] Acetyl (K)       1
    4:      2 Acetyl Acetyl     NA AAIKFIKFINPKINDGE                 [4] Acetyl (K)|[7] Acetyl (K)       1
    5:      3 Acetyl Acetyl Acetyl AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)       2
    6:      2 Biotin Acetyl     NA    AAIYKLLKSHFRNE                 [5] Biotin (K)|[8] Acetyl (K)       3
    7:      1 Acetyl     NA     NA           AAKKFEE                                [3] Acetyl (K)       4
    

相关问题