Converting Date Type Data Formats
In real-world scenarios, data analysts may encounter datasets with hundreds of thousands of records and hundreds of variables. Processing such large datasets requires substantial computer memory, so if the analysis is to be done locally, one should use a computer with large memory space when possible. Additionally, employing effective strategies for data handling can significantly improve analytical efficiency.
Workspace Cleanup
To maximize available memory during data analysis, it is recommended to clean the workspace before initiating any new analysis projects.
# Clear the workspace
rm(list = ls(all = TRUE))
The ls()
function is used to display objects in the current workspace, with the all
parameter defaulting to FALSE
. Setting it to TRUE
clears all objects, including hidden ones.
During the data analysis process, use the rm(object1, object2, ...)
command for timely removal of temporary objects.
Fast Reading of .csv Files
.csv files are commonly used for storing data due to their small size and compatibility with applications like Excel. While the read.csv()
function is convenient, it can be slow when dealing with large datasets. In such cases, the read_csv()
function from the readr
package or the fread()
function from the data.table
package can be approximately twice as fast for retrieval.
The data.table
package offers an advanced version of data frames, significantly enhancing data processing speed. This package is particularly suitable for users dealing with large datasets (e.g., 1 – 100GB) in memory.
Simulating a Large Dataset
For illustration purposes, we simulate a large dataset consisting of 50,000 records and 200 variables.
dataset <- as.data.frame(matrix(rnorm(50000 * 200), ncol = 200))
varnames <- NULL
for (i in letters[1:20]) {
for (j in 1:10) {
varnames <- c(varnames, paste(i, j, sep = "_"))
}
}
names(dataset) <- varnames
names(dataset)
If you prefer to avoid nested loops, consider alternative methods.
# Unfortunately, using apply here would result in extra spaces
# apply(expand.grid(1:20, letters[1:20]), 1, function(x) paste(x[2], x[1], sep="_"))
# sprintf("%s_%s", expand.grid(1:10,letters[1:20])[,2],expand.grid(1:10,letters[1:20])[,1])
# Or
# as.vector(t(outer(letters[1:20], 1:10, paste, sep="_")))
Excluding Unnecessary Variables
Before conducting formal analysis, remove temporarily unused variables to reduce memory burden. The select
functions from the dplyr
package can be useful, especially when combined with functions like starts_with()
, ends_with()
, and contains()
from the tidyselect
package.
First, load these two packages:
library(dplyr)
library(tidyselect)
Here’s an example demonstrating the use of select
functions to choose variables.
subdata1 <- select(dataset, starts_with("a")) names(subdata1) subdata2 <- select(dataset, ends_with("2")) names(subdata2)
subdata1
selects all variables starting with "a," whilesubdata2
selects all variables ending with "2".
To select variables starting with “a” and those with “b”:
subdata3 <- select_at(dataset, vars(starts_with("a"), starts_with("b")))
names(subdata3)
To select variables containing certain characters, utilize the contains()
function. For instance, to select variables containing the character “1,” use command:
subdata4 <- select_at(dataset, vars(contains("1")))
names(subdata4)
To exclude certain variables, prefix starts_with()
, ends_with()
, and contains()
functions with “-” sign. For example, to exclude variables ending with “1”, use command:
subdata5 <- select_at(dataset, vars(-contains("1")))
names(subdata5)
Selecting a Random Sample of the Dataset
Processing all records in large datasets often reduces analytical efficiency. During code development, it’s helpful to test programs by extracting a subset of records to optimize code and eliminate bugs.
sampledata1 <- sample_n(subdata5, size = 500)
nrow(sampledata1)
sampledata2 <- sample_frac(subdata5, size = 0.02)
nrow(sampledata2)
The sample_n()
and sample_frac()
functions are used to randomly select a specified number of rows from a data frame. The size
parameter in sample_n()
specifies the number of rows, while in sample_frac()
, it indicates the proportion of rows.
It’s important to note that the strategies discussed above for handling large datasets are suitable for datasets at the GB level. Several packages in R can handle TB-level datasets, such as RHIPE, RHadoop, and RevoScaleR. These packages have steep learning curves and require some understanding of high-performance computing.