首页 文章

第一次观察没有行号

提问于
浏览
-1

我正在阅读一个如下所示的数据集:

enter image description here

我的代码如下:

NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                  header= TRUE, 
                  sep = "\t",
                  quote = "\"",
                  dec = ".",
                  fill = TRUE,
                  as.is = c("ParkName", "State"))

然后我收到如下警告:

警告消息:1:在read.table中(file = file,header = header,sep = sep,quote = quote,:readTableHeader在'/ Volumes / Elements / STAT_611 / 611 / DATA / DATA11 / NatPark_Plus上找到的不完整的最后一行.dat'2:在read.table中(file = file,header = header,sep = sep,quote = quote,:并非'as.is'中所有列都存在

所以我将“header = TRUE”更改为“header = FALSE”,如下所示:

NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                    header= FALSE, 
                      sep = "\t",
                      quote = "\"",
                      dec = ".",
                      fill = TRUE,
                      as.is = c("ParkName", "State"))

我收到了同样的警告信息:

警告消息:1:在read.table中(file = file,header = header,sep = sep,quote = quote,:readTableHeader在'/ Volumes / Elements / STAT_611 / 611 / DATA / DATA11 / NatPark_Plus上找到的不完整的最后一行.dat'2:在read.table中(file = file,header = header,sep = sep,quote = quote,:并非'as.is'中所有列都存在

这一行的所有行号都显示在下面,如下所示 . 但是,我不明白str(NatPark)是什么意思 . 什么是“v1”?并且“4 1 5 2 3”紧随其后?谢谢你的任何建议!谢谢!

enter image description here

2 回答

  • 1

    我没有使用 .dat 文件,但如果您可以共享下载链接,我可以帮助进一步排除故障 . 到目前为止,我可以提供以下见解:

    • V1 (和V2,V3,V4 ...)是指R没有 Headers 时自动分配的列名 . 由于只有V1,所以当然R认为你只有1列当前设置 .

    • 关于 "4 1 5 2 3" ,您从 str 的输出中看到,该因子变量引用了数字级别(在这种情况下,整行被读作一个变量) . 默认情况下,R始终按字母顺序对级别进行排序 . 虹膜数据集中的这个示例应该有助于澄清:

    str(iris)
    #> 'data.frame':    150 obs. of  5 variables:
    #>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    #>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    #>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    #>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    #>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    head(iris$Species)
    #> [1] setosa setosa setosa setosa setosa setosa
    #> Levels: setosa versicolor virginica
    levels(iris$Species)
    #> [1] "setosa"     "versicolor" "virginica"
    

    reprex package(v0.2.0)创建于2018-08-18 .

    您可以看到值 setosa 被认为是 1 ,因为它是第一级, setosa 是2,而 virginica 是3.但是,这应该都是一个没有实际意义的点,因为您不希望将整行读作一个变量 .

  • 1

    关于您的主要问题,我能够组合一个自定义函数来解析您的数据的数据 . 将来,如果有一个引用源数据中的文本的选项,事情可能会简单得多 . 无论如何,希望这对你有用!您只需设置列名称并将某些列从字符更改为数字 .

    library(tidyverse)
    library(stringr)
    
    directory <- "/Users/jas/Desktop"
    filename <- "NatPark_Plus.dat"
    file <- file.path(directory, filename)
    
    # tabs
    data <- read.delim(file, header = FALSE, sep = "\t")
    #> Warning in read.table(file = file, header = header, sep = sep, quote =
    #> quote, : incomplete final line found by readTableHeader on '/Users/jas/
    #> Desktop/NatPark_Plus.dat'
    
    # We have 5 records, but the spacing amongst them is uneven and some words with spaces
    
    text <- data$V1
    
    # Parse text to make same number of columns - 4
    # Creates a separate dataframe for each row
    parse_text_to_df <- function(x) {
      # Find more than one spaces and replace with tab
      x <- gsub("[ ]{2,}", "\t", x)
      # replace remaining space with tab (cannot use comma since numbers have comma)
      x <- gsub(" ", "\t", x)
      # Should be only 3 tabs on each line - WORKS FOR THIS DATASET ONLY
      total_tabs <- stringr::str_count(x, "\t")
      # If we have those words with spaces, we need to remove the extra tabs between them
      if (total_tabs[1] > 3) {
        num_tabs_to_remove <- total_tabs - 3
        for (i in range(num_tabs_to_remove)) {
          x <- sub("\t", " ", x)
        }
      }
      # Convert to an object that can be read back into a dataframe
      x <- readLines(textConnection(x))
      df <- read.delim(text = x, header = FALSE, sep = "\t") %>%
        mutate_all(as.character)
      return(df)
    }
    
    # Combine each of the 1 row dataframes into one dataframe (all character vectors)
    df <- text %>% map_df(parse_text_to_df)
    df
    #>                      V1       V2   V3        V4
    #> 1           Yellowstone ID/MT/WY 1872 4,065,493
    #> 2            Everglades       FL 1934 1,398,800
    #> 3              Yosemite       CA 1864   760,917
    #> 4 Great Smoky Mountains    NC/TN 1926   520,269
    #> 5        Wolf Trap Farm       VA 1966       130
    

    reprex package(v0.2.0)创建于2018-08-18 .

相关问题