Data come in many formats. Some formats are universal and easily imported into most statistical programs, while others are proprietary and tailored to a specific program.
While incompatibility between datafile formats and various programs used to be a considerable annoyance, most programs have made it rather easy to import data formats regardless of the format’s origin. However, in some cases, features available in one language will be unavailable in another or might require additional formatting.
The table below provides a brief summary of formats commonly used to store datasets.
Table 4.1: Common dataset formats
Format | Created In | Description |
---|---|---|
CSV | Almost all spreadsheet programs, plain text editors, and programs for statistical analysis. | CSV is a good universal format in which to store data. Any statistical program should have a built-in method by which to import CSV files. If you’re used to using text formatting in Excel (e.g., bold, italics, multiple worksheets), note that CSV does not preserve these features. If you have multiple sheets in an Excel file, you should make sure to save each to a separate CSV file. |
TXT | Almost all spreadsheet programs, plain text editors, and programs for statistical analysis. | Like CSV files, TXT files can be read by almost all statistical programs across different operating systems. Note that TXT files can use a number of different types of delimiters (column separators), including commas, spaces, and tabs. |
Rdata, RDS | R | We’ll get into the distinctions between these filetypes in a little bit. Both are specific to R, and if you want to transition smoothly between programs you plan to use for different tasks with the same dataset, it may be best to use CSV or TXT. |
sas7bdat | SAS | Probably the most commonly used SAS-based dataset format, sas7bdat is capable of storing datasets with native SAS labeling and other features. If you find yourself moving between R and SAS, R has a number of packages available capable of importing and writing to the sas7bdat format, though important SAS variable labels requires an additional step or two. |
DTA | Stata | While this format is native to Stata, both SAS and R can import the format easily. |
XLS, XLSX | Microsoft Excel | It is highly recommended that datasets not be saved in Excel format. Use a plain-text format like CSV or TXT or, alternatively, a statistical program’s native export format whenever possible. |
Before we begin, you’ll want to install two R packages:
# if you've already installed the tidyverse package, you should have readr
install.packages("readr")
install.packages("haven")
# We could also do the following to avoid having to use install.packages()
# multiple times:
# pkgs <- c("readr", "haven")
# install.packages(pkgs)
# load packages
library(readr)
library(haven)
# import a standard CSV file
readr::read_csv("data/iris.csv")
iris_csv <-
# import a tab-delimited TXT file, we specify the tab delimiter as \t
readr::read_delim("data/iris.txt", delim = "\t")
iris_txt <-
# import a sas7bdat file
haven::read_sas("data/iris.sas7bdat") iris_sas <-
Note that we’ve called our various read functions by telling R which package includes each one. Because we’ve already called library()
to load the readr and haven packages, we don’t need to use package::function()
. However, for clarity, we’ll use them from time to time to mark which package hosts a particular function. Furthermore, calling functions in this way can be done without loading the package, which can come in handy if you want to avoid loading a large package that takes up memory.
Let’s check the results for each imported version of the iris
dataset from R.
head(iris_csv)
# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
head(iris_txt)
# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
head(iris_sas)
# A tibble: 6 × 5
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<dbl> <dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 5 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
Looking good, but we should notice that the variable names from the SAS import are slightly different because periods are not permissible characters in SAS variable names.
When writing data, we use the write
function for the desired output format. Typically, you’ll tell the function which data.frame
you want to save and give it the desired path and filename.
::write_csv(iris_csv, "data/iris2.csv")
readr::write_delim(iris_txt, "data/iris2.txt", delim = "\t")
readr::write_delim(iris_txt, "data/iris3.txt", delim = ",")
readr::write_sas(iris_sas, "data/iris2.sas") haven
Note that we could have passed any of the data.frames we created into any of the functions. For instance, we could have passed iris_csv
to haven::write_sas()
in order to save it in sas7bdat
format.