Data analysis is a central part of the Business Intelligence Job. Increased data availability, more powerful computing, and an emphasis on analytics-driven decision in business has made it a heyday for data science. According to a report from IBM, in 20th October 2018 there were 2.35 million openings for data analytics jobs in the US. It estimates that number will rise to 4.72 million by 2020.
A significant share of people who crunch numbers for a living use Microsoft Excel, Penthao, Microsoft Tools, OBIEE, or other spreadsheet programs like Google Sheets. Others use proprietary statistical software like SAS, Stata, or SPSS that they often first learned in school.
While Excel and SAS are powerful tools, they have serious limitations. Excel cannot handle datasets above a certain size, and does not easily allow for reproducing previously conducted analyses on new datasets. The main weakness of programs like SAS are that they were developed for very specific uses, and do not have a large community of contributors constantly adding new tools.
R is the solution to all and it is the mos popular programming languages used by data analysts and data scientists. It is free and and open source, and were developed in the early 1990s. For anyone interested in machine learning, working with large datasets, or creating complex data visualizations, R is truly a godsends.
In a single quote, R is good for ad hoc analysis and exploring datasets.
Memory management will blocks of code16-May-2019
A solid understanding of R’s memory management will help you predict how much memory you’ll need for a given task and help you to make the most of the memory you have. It can even help you write faster code because accidental copies are a major cause of slow code. The goal of this chapter is to help you understand the basics of memory management in R, moving from individual objects to functions to larger blocks of code. Along the way, you’ll learn about some common myths, such as that you need to call gc() to free up memory, or that for loops are always slow.
Object size shows you how to use object_size() to see how much memory an object occupies, and uses that as a launching point to improve your understanding of how R objects are stored in memory.
Memory usage and garbage collection introduces you to the mem_used() and mem_change() functions that will help you understand how R allocates and frees memory.
Memory profiling with lineprof shows you how to use the lineprof package to understand how memory is allocated and released in larger code blocks.
Modification in place introduces you to the address() and refs() functions so that you can understand when R modifies in place and when R modifies a copy. Understanding when objects are copied is very important for writing efficient R code.
In this chapter, we’ll use tools from the pryr and lineprof packages to understand memory usage, and a sample dataset from ggplot2. If you don’t already have them, run this code to get the packages you need:
install.packages("ggplot2") install.packages("pryr") install.packages("devtools") devtools::install_github("hadley/lineprof")
The details of R’s memory management are not documented in a single place. Most of the information in this chapter was gleaned from a close reading of the documentation (particularly
?gc), the rest API: The rest I figured out by reading the C source code, performing small experiments, and asking questions on R-devel. Any mistakes are entirely mine.