第一次观察没有行号-Java 学习之路

-1

我正在阅读一个如下所示的数据集：

enter image description here

我的代码如下：

NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                  header= TRUE, 
                  sep = "\t",
                  quote = "\"",
                  dec = ".",
                  fill = TRUE,
                  as.is = c("ParkName", "State"))

然后我收到如下警告：

警告消息：1：在read.table中（file = file，header = header，sep = sep，quote = quote，：readTableHeader在'/ Volumes / Elements / STAT_611 / 611 / DATA / DATA11 / NatPark_Plus上找到的不完整的最后一行.dat'2：在read.table中（file = file，header = header，sep = sep，quote = quote，：并非'as.is'中所有列都存在

所以我将“header = TRUE”更改为“header = FALSE”，如下所示：

NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                    header= FALSE, 
                      sep = "\t",
                      quote = "\"",
                      dec = ".",
                      fill = TRUE,
                      as.is = c("ParkName", "State"))

我收到了同样的警告信息：

警告消息：1：在read.table中（file = file，header = header，sep = sep，quote = quote，：readTableHeader在'/ Volumes / Elements / STAT_611 / 611 / DATA / DATA11 / NatPark_Plus上找到的不完整的最后一行.dat'2：在read.table中（file = file，header = header，sep = sep，quote = quote，：并非'as.is'中所有列都存在

这一行的所有行号都显示在下面，如下所示 . 但是，我不明白str（NatPark）是什么意思 . 什么是“v1”？并且“4 1 5 2 3”紧随其后？谢谢你的任何建议！谢谢！

enter image description here

2 回答

1
我没有使用 .dat 文件，但如果您可以共享下载链接，我可以帮助进一步排除故障 . 到目前为止，我可以提供以下见解：
- V1 （和V2，V3，V4 ...）是指R没有 Headers 时自动分配的列名 . 由于只有V1，所以当然R认为你只有1列当前设置 .
- 关于 "4 1 5 2 3" ，您从 str 的输出中看到，该因子变量引用了数字级别（在这种情况下，整行被读作一个变量） . 默认情况下，R始终按字母顺序对级别进行排序 . 虹膜数据集中的这个示例应该有助于澄清：
```
str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris$Species)
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
levels(iris$Species)
#> [1] "setosa"     "versicolor" "virginica"
```
由reprex package（v0.2.0）创建于2018-08-18 .

您可以看到值 setosa 被认为是 1 ，因为它是第一级， setosa 是2，而 virginica 是3.但是，这应该都是一个没有实际意义的点，因为您不希望将整行读作一个变量 .
回复于 2024-05-03T15:36:01+08:00

关于您的主要问题，我能够组合一个自定义函数来解析您的数据的数据 . 将来，如果有一个引用源数据中的文本的选项，事情可能会简单得多 . 无论如何，希望这对你有用！您只需设置列名称并将某些列从字符更改为数字 .

library(tidyverse)
library(stringr)

directory <- "/Users/jas/Desktop"
filename <- "NatPark_Plus.dat"
file <- file.path(directory, filename)

# tabs
data <- read.delim(file, header = FALSE, sep = "\t")
#> Warning in read.table(file = file, header = header, sep = sep, quote =
#> quote, : incomplete final line found by readTableHeader on '/Users/jas/
#> Desktop/NatPark_Plus.dat'

# We have 5 records, but the spacing amongst them is uneven and some words with spaces

text <- data$V1

# Parse text to make same number of columns - 4
# Creates a separate dataframe for each row
parse_text_to_df <- function(x) {
  # Find more than one spaces and replace with tab
  x <- gsub("[ ]{2,}", "\t", x)
  # replace remaining space with tab (cannot use comma since numbers have comma)
  x <- gsub(" ", "\t", x)
  # Should be only 3 tabs on each line - WORKS FOR THIS DATASET ONLY
  total_tabs <- stringr::str_count(x, "\t")
  # If we have those words with spaces, we need to remove the extra tabs between them
  if (total_tabs[1] > 3) {
    num_tabs_to_remove <- total_tabs - 3
    for (i in range(num_tabs_to_remove)) {
      x <- sub("\t", " ", x)
    }
  }
  # Convert to an object that can be read back into a dataframe
  x <- readLines(textConnection(x))
  df <- read.delim(text = x, header = FALSE, sep = "\t") %>%
    mutate_all(as.character)
  return(df)
}

# Combine each of the 1 row dataframes into one dataframe (all character vectors)
df <- text %>% map_df(parse_text_to_df)
df
#>                      V1       V2   V3        V4
#> 1           Yellowstone ID/MT/WY 1872 4,065,493
#> 2            Everglades       FL 1934 1,398,800
#> 3              Yosemite       CA 1864   760,917
#> 4 Great Smoky Mountains    NC/TN 1926   520,269
#> 5        Wolf Trap Farm       VA 1966       130

由reprex package（v0.2.0）创建于2018-08-18 .

回复于 2024-05-03T15:36:01+08:00

第一次观察没有行号

2 回答

相关问题