Skip to content

HuichunChien/readr

 
 

Repository files navigation

readr

Build Status

The goal of readr is to provide a fast and friendly way to read tabular data into R. The most important functions are:

  • Read delimited files: read_delim(), read_csv(), read_tsv(), read_csv2().
  • Read fixed width files: read_fwf(), read_table().
  • Read lines: read_lines().
  • Read whole file: read_file().
  • Re-parse existing data frame: type_convert().

Installation

readr is now available from CRAN.

install.packages("readr")

You can try out the dev version with:

# install.packages("devtools")
devtools::install_github("hadley/readr")

Usage

library(readr)
library(dplyr)

mtcars_path <- tempfile(fileext = ".csv")
write_csv(mtcars, mtcars_path)

# Read a csv file into a data frame
read_csv(mtcars_path)
# Read lines into a vector
read_lines(mtcars_path)
# Read whole file into a single string
read_file(mtcars_path)

Column types

Currently, readr automatically recognises the following types of columns:

  • col_logical() [l], containing only T, F, TRUE or FALSE.
  • col_integer() [i], integers.
  • col_double() [d], doubles.
  • col_euro_double() [e], "Euro" doubles that use , as decimal separator.
  • col_character() [c], everything else.
  • col_date(format = "") [D]: Y-m-d dates.
  • col_datetime(format = "", tz = "UTC") [T]: ISO8601 date times

To recognise these columns, it reads the first 100 rows of your dataset. This is not guaranteed to be perfect, but it's fast and a reasonable heuristic. If you get a lot of parsing failures, you'll need to re-read the file, overriding the default choices as described below.

You can also manually specify other column types:

  • col_skip() [_], don't import this column.
  • col_datetime(date), dates with given format.
  • col_datetime(format, tz), date times with given format. If the timezone is UTC, this is >20x faster than loading then parsing with strptime().
  • col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9, - and . (this is useful for parsing data formatted as currencies).
  • col_factor(levels, ordered), parse a fixed set of known values into a factor

Use the col_types argument to override the default choices. There are two ways to use it:

  • With a string: "dc__d": read first column as double, second as character, skip the next two and read the last column as a double. (There's no way to use this form with types that need parameters like date time and factor.)

  • With a (named) list of col objects:

    read_csv("iris.csv", col_types = list(
      Sepal.Length = col_double(),
      Sepal.Width = col_double(),
      Petal.Length = col_double(),
      Petal.Width = col_double(),
      Species = col_factor(c("setosa", "versicolor", "virginica"))
    ))

    Any omitted columns will be parsed automatically, so the previous call is equivalent to:

    read_csv("iris.csv", col_types = list(
      Species = col_factor(c("setosa", "versicolor", "virginica"))
    )

Output

read_csv() produces a data frame with the following properties:

  • Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE).

  • Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE).

  • The data frame is given class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you'll get an enhanced display.

  • Row names are never set.

Problems

If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

df <- read_csv(col_types = "dd", col_names = c("x", "y"), skip = 1, "
1,2
a,b
")
#> Warning message: There were 2 problems. See problems(x) for more details
problems(df)
#>   row col expected actual
#> 1   2   1 a double      a
#> 2   2   2 a double      b

It's likely that there will be cases that you can never load without some manual regexp-based munging in R. Load those columns with col_character(), fix them up as needed, then use convert_types() to re-run the automated conversion on every character column in the data frame. Alternatively, you can use parse_integer(), parse_numeric(), parse_date() etc to parse a single character vector at a time.

Compared to base functions

Compared to the corresponding base functions, readr functions:

  • Use a consistent naming scheme for the parameters (e.g. col_names and col_types not header and colClasses).

  • Are much faster (up to 10x faster).

  • Have a helpful progress bar if loading is going to take a while.

Compared to fread()

data.table has a function similar to read_csv() called fread. Compared to fread, readr:

  • Is slower (currently ~1.2-2x slower. If you want absolutely the best performance, use data.table::fread().

  • Readr has a slightly more sophisticated parser, recognising both doubled ("""") and backslash escapes ("""). Readr allows you to read factors and date times directly from disk.

  • fread() saves you work by automatically guessing the delimiter, whether or not the file has a header, how many lines to skip by default and more. Readr forces you to supply these parameters.

  • The underlying designs are quite different. Readr is designed to be general, and dealing with new types of rectangular data just requires implementing a new tokenizer. fread() is designed to be as fast as possible. fread() is pure C, readr is C++ (and Rcpp).

Acknowledgements

Thanks to:

  • Joe Cheng for showing me the beauty of deterministic finite automata for parsing, and for teaching me why I should write a tokenizer.

  • JJ Allaire for helping me come up with a design that makes very few copies, and is easy to extend.

  • Dirk Eddelbuettel for coming up with the name!

About

Read flat files (csv, tsv, fwf) into R

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 64.7%
  • R 35.3%