Hello world!

Q. What is it?
A. A tour of Data Science using R. It will make familiar with the typical steps of a data science project like data gathering, data preparation & analysis and developing meaning out of it.
Q. For whom is this for?
A. Anybody who wants to explore the exciting world of data science.
Q. What this is not?
A. This is not an alternative to a rigorous data science curriculum. This is only a glimpse of what data science can do.

What is Data Science?

Data Science isn’t fundamentally a new field. We’ve long had statisticians, analysts and programmers. What’s new is the way it combines several different fields into one. These fields are mathematics primarily statistics and linear algebra, computing skills like programming and domain knowledge along with communication skills.
Data science is about predictions and associations but not causality.

  1. Examples of some data science projects
    • Determining which customers are most likely to repay a loan
    • Determining the optimal coupon amount and timing for an ecommerce store
    • Assisting a manufacturing company to reduce unplanned downtime by predicting machine failure before they occur
    • Recommending which models of cars a rental car company should purchase to optimize profits

Steps for Data Science project

It all starts with asking an interesting question and then…

  1. Problem definition and planning:
    1. Identify problem
    2. List the projects deliverables
    3. Generate success factors
    4. Understand each resource and other limitations
    5. Put together appropriate team
    6. Create a plan
    7. Perform a cost/benefit analysis
  2. Data preparation:
    1. Access and combine data tables
    2. Summarize data
    3. Look for errors
    4. Transform data
    5. Segment data
  3. Analysis:
    1. Summarize data
    2. Exploring relationships between attributes
    3. Grouping the data
    4. Identifying non-trivial facts, patterns and trends
    5. Build regression models
    6. Build classification models
  4. Deployment:
    1. Generate report
    2. Deploy standalone or integrated decision tool
    3. Measure impact

#Meta

Data point(observation): A single instance or observation, usually represented as a row of a table.

Data set: A table is a form of data set. Data is usually represented in form of a table having multiple rows representing multiple observations and columns representing variables(raw) of an observation.

Random variable: A random variable, random quantity, aleatory variable, or stochastic variable is a variable quantity whose value depends on possible outcome. As a function, a random variable is required to be measurable, which rules out certain pathological cases where the quantity which the random variable returns is infinitely sensitive to small changes in the outcome.

  • Random variables are of two types:
    1. Discrete: Countable
    2. Continuous: Uncountable
  1. Types of variables: Variables can be classified in a number of ways not necessarily mutually exclusive.
    1. Discrete vs Continuous:
      1. Discrete: Can only take a fixed number of values. e.g. car model
      2. Continuous: Can take any value from a continuous set. e.g. height
    2. Variables classified according to scale:
      1. Nominal/ Categorical scale: It can only take fixed number of values and can not be ordered. e.g. color.
      2. Ordinal scale: It can only take fixed number of values but can be ordered. e.g. low, medium, high. It is impossible to determine the magnitude of difference e.g. (high – low) doesn’t necessarily make any sense in every context.
      3. Interval scale: An interval variable is the one whose difference between two values is meaningful. The difference between a temperature of 100 degrees and 90 degrees is the same difference as between 90 degrees and 80 degrees.
      4. Ratio scale: A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. F & C are not ratio variables but K is.
    3. Other categories:
      1. Dichotomous: A variable that can contain only two values (ie: On or Off)
      2. Binary: A dichotomous variable that is encoded as 0 or 1 (ie: On = 0, Off = 1)
      3. Primary key: A variable that is a unique identifier for a particular record. (e.g. SSN may be the primary key for describing a citizen, customerId may be the primary key for describing a customer in a database)

Big Data vs Wide Data: As additional variables are collected, data sets become “wide” e.g. DNA sequence data having millions of columns.

4 V’s of Big Data: Volume, Variety, Veracity(trustworthiness), Velocity.

Raw data: Unaltered data sets are typically referred to as “raw data”.

Features: Features are combinations of various raw variables that determine the maximum variation in data.

Dimensionality Reduction: Process of reducing the number of random variables under consideration. It is done by taking existing data and reducing it to the most discriminative components. These components allow to represent most of the information in the dataset under consideration with fewer, more discriminative features. This can be divided into feature selection and feature extraction.

Feature selection: Selecting features which are highly discriminative and determine the maximum variations within data. It requires an understanding of what aspects of dataset are important and which aren’t. Can be done by the help domain experts, clustering techniques or topical analysis.

Feature extraction: Building a new set of features from the original feature set.
Examples of feature extraction: extraction of contours in images, extraction of diagrams from a text, extraction of phonemes from recording of spoken text, etc.
Feature extraction usually involves generating new features which are composites of existing features. Both of these techniques fall into the category of feature engineering. Generally feature engineering is important to obtain the best results as it involves creating information that may not exist in the dataset, and increasing signal to noise ratio. Feature extraction involves a transformation of the features, which often is not reversible because some information is lost in the process.

Model: Mathematical model(equations) that defines the relationship between various variables of a data set that helps in predicting the values/ behavior of variables of any future unseen data point. e.g. y = mx + c, a linear model describing relationship b/w two variables x & y.

  1. Fundamentally the model building process is threefold:
    1. Model Selection: Choose or create a mathematical model.
    2. Training: Determine the parameters of the model that fit the training data as closely as possible.
    3. Testing: Evaluating the accuracy of the model on the test data so that it can be used for prediction of future unseen data.

Training data: Part of data used to determine the parameters of the model.

Testing Data: Part of data which is used to determine the accuracy of the model generated using training data. i.e. how well it works on the future unseen data.

Model Accuracy: The percentage of the unseen future cases(data) the generated model holds good for.

Overfitting: A model overfits if it works on the training set perfectly but does not predict the future cases accurately. This xkcd cartoon strip describes overfitting in real life:

Regularization: Process of determining what features should be included or weighted in your final model to avoid overfitting.

Pruning and Selection: Determining what features contain the best signal and discard the rest.

Shrinkage: Reducing the influence of some features to avoid overfitting. It can be done in multiple ways like assigning weights to variables or adding an overall cost function.

Cross Validation: Technique of simulating “out of sample” or unseen future tests to determine the accuracy of the model. Models are built and evaluated on different data sets. It helps avoid overfitting and build models that are hopefully generalizable.

Out of sample performance: If we collect data from the exact same environment, the model will be able to predict outcomes with the same degree of performance.

A brief digression into Probability & Statistics

We measure the sample using statistics in order to draw inferences about the population and its parameters.
Samples are collected through random selection from a population. This process is called sampling.

  1. Types of statistics:
    1. Descriptive: Number that describes or gives a gist of the population data under consideration.
    2. Inferential: Making conclusion about the population.

Notion of variability: Degree to which data points are different from each other or the degree to which they vary.

  1. Methods of sampling data:
    1. Random sampling: Every member and set of members has an equal chance of being included in the sample.
    2. Stratified random sampling: The population is first split into groups. The overall sample consists of some members from every group. The members from each group are chosen randomly. Most widely used sampling technique.
    3. Cluster random sample: The population is first split into groups. The overall sample consists of every member from some groups. The groups are selected at random.
    4. Systematic random sample: Members of the population are put in some order. A starting point is selected at random and every nth member is selected to be in the sample.
  1. Types of Statistical Studies:
    1. Sample study(informal term): Taking random samples to generate a statistic to estimate the population parameter.
    2. Observational study: Observing a correlation but not sure of the causality.
    3. Controlled experiment: Experimenting to confirm the observation by forming a control group and treatment group. It is done by randomly assigning people or things to groups. One group receives a treatment and the other group doesn’t.

Running an experiment is the best way to conduct a statistical study. The purpose of a sample study is to estimate certain population parameter while observational study and experiment is to compare two population parameters.

Describing data
Central Tendency: There are different ways of understanding central tendency
Mean: Arithmetic mean value
Median: Middle value
Mode: Highest frequency value
e.g. Samples of observations of a variable \(x_i\) = 2, 4, 7, 11, 16.5, 16.5, 19
\(n\) = 7
Mean \({\left({\bar{x}} \right)}\) = \(\frac{\sum_{i=1}^{i=n} {x_i}}{n}\) = (2 + 4 + 7 + 11 + 15 + 16.5 + 19)/7 = 10.643
Median = 11
Mode = 16.5

Median is preferred when the data is skewed or subject to outliers.
WARNING: A median value significantly larger than the mean value should be investigated!

Measuring spread of data:
Range: Maximum value – Minimum value = 19 – 2 = 17
Variance \({\left({s^2_{n-1}} \right)}\) = \(\frac{\sum{\left( {x_i – \bar{x}} \right)^2}}{n-1}\)

Standard Deviation \({\left({s_{n-1}} \right)}\) = \(\sqrt{Variance}\)
Range is a quick way to get an idea of the spread. IQR takes longer to compute but it sometimes gives more useful insights like outliers or bad data points etc.

Interquartile Range: IQR is amount of spread in the middle 50% of the data set. In the previous e.g.
Q1(25% of data) = (2 + 4 + 7 + 11)/4 = 6
Q2(50% of data) = 11
Q3(75% of data) = (11 + 16.5 + 16.5 + 19)/4 = 15.75
IQR = Q3 – Q1 = 15.75 – 6 = 9.75

Questioning the underlying reason for distributional non-unimodality frequently leads to greater insight and improved deterministic modeling of the phenomenon under study.

    • Very common in real-world data.
    • Best practice is to look into this.
  • Plotting Data
    • Bar chart: It is made up of columns plotted on a graph and used for categorical variable.
    • Frequency histogram(Histogram): It is made up of columns plotted on a graph and used for quantitative variable. Usually obtained by splitting the range of a continuous variable into equal sized bins(classes).

    • Both display relative frequencies of different variables. With bar charts, the labels on the X axis are categorical; with histograms, the labels are quantitative. Both are useful in detecting outliers(odd data points).

    • Boxplots: The boxplot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest boxplot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum. The extreme values (within 1.5 times the interquartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.

  • Shape
    • Skewness: Measure of degree of asymmetry of a variable.

      Skewness = \({\frac{n\sqrt{n-1}}{n-2}\frac{\sum_{i=1}^{n}\left(x_i\,-\,\bar{x}\right)^3}{\left({\sum_{i=1}^{n}\left(x_i\,-\,\bar{x}\right)^2}\right)^{3/2}}}\)

      • Value of 0 indicates a perfectly symmetric variable.
      • Positive skewness: The majority of observations are to the left of the mean.
      • Negative skewness: The majority of observations are to the right of the mean.
    • Kurtosis: A measure of how “tailed” a variable is.
      • Variables with a pronounced peak near the mean have high kurtosis.
      • Variables with a flat peak have a low kurtosis.

Values for skewness and kurtosis near zero indicate the variable approximates a normal distribution.

Sample Statistic and Population Parameter
Each sample statistic has a corresponding unknown population value called a parameter. e.g. population mean, variance etc. are called parameter whereas sample mean, variance etc. are called statistic.

Population (Parameter) Sample (Statistic)
Mean \({\mu}={\frac{\sum_{i=1}^{N}{x_i}}{N}}\) \({\bar{x}}={\frac{\sum_{i=1}^{n}{x_i}}{n}}\)
Variance \({\sigma^2} = {\frac{\sum {\left(x_i – \mu\right)^2}}{N}}\) \({s^2_{n-1}} = {\frac{\sum {\left(x_i – \bar{x}\right)^2}}{n-1}}\)
Standard Deviation \({\sigma}\) \({s}\) or \({s_{n-1}}\)

There are many more sample statistics and their corresponding population parameters.

Probability
Probability: \(P\left(A\right)\) = \(\frac{\# \, of\,favourable\,outcomes }{Total\,\#\, of\,outcomes}\)

Conditional Probability: \(P\left(A\,|\,B\right)\) = \(\frac{P\left(A\,\cap\,B\right)}{P\left(B\right)}\,\Rightarrow\) A is dependent on B

Bayes Theorem: \(P\left(A\,|\,B\right)\) = \(\frac{P\left(B\,|\,A\right)\,P\left(A\right)}{P\left(B\right)}\)

Probability Distribution
A mathematical function that, stated in simple terms, can be thought of as providing the probability of occurrence of different possible outcomes in an experiment.
Let’s say we have a random variable \(X\) = # of HEADS from flipping a coin 5 times.

\(P\left(X=0\right)\) = \(\frac{{5}\choose{0}}{32}\) = \({\frac{1}{32}}\), \(P\left(X=1\right)\) = \(\frac{{5}\choose{1}}{32}\) = \({\frac{5}{32}}\)

\(P\left(X=2\right)\) = \(\frac{5\choose2}{32}\) = \({\frac{10}{32}}\), \(P\left(X=3\right)\) = \(\frac{{5}\choose{3}}{32}\) = \({\frac{10}{32}}\)

\(P\left(X=4\right)\) = \(\frac{{5}\choose{4}}{32}\) = \({\frac{5}{32}}\), \(P\left(X=5\right)\) = \(\frac{{5}\choose{5}}{32}\) = \({\frac{1}{32}}\)

Central Limit Theorem
Suppose that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of other observations and arithmetic average of the observations is computed. If this procedure of random sampling and computing the average of observations is performed many times, the central limit theorem says that the computed values of the average will be distributed according to the normal distribution (commonly known as a “bell curve”). A simple example of this is that if one flips a coin many times the probability of getting a given number of heads in a series of flips should follow a normal curve, with mean equal to half the total number of flips in each series as shown previously.

Sampling distribution of the sample mean
Random variables can have different distribution patterns. They can be normal or multi-modal as shown below.

To plot a sampling distribution of sample means(can be mode, median etc.) we draw samples of certain size(say 3) from a distribution and compute its mean.

Note: The mean of sampling distribution(mean of means) is same as the population mean \((\mu_{\bar{x}}\) = \(\,\mu)\). As the number of samples \(\left(S_i\right)\) approach infinity the curve approximates a normal distribution.

Standard Error
Variance of the sampling distribution of the sample mean. The standard error of the mean is the expected value(average) of the standard deviation of several samples, this is estimated from a single sample as:
\({SE}_{\bar{x}}^2\) = \(\frac{s^2}{n}\) \(\Rightarrow\) larger the sample size lower the variance.
s is standard deviation of the sample, n is the sample size.

WARNING: \({SE}_{\bar{x}}\) = sampling distribution standard deviation (not sample standard deviation).

  • Confidence Interval
    • Each sample statistic has a corresponding unknown population value called parameter.
    • How well these sample statistic estimate the underlying population value?
    • Confidence interval is the range which is likely to contain the population parameter of interest.
    • Confidence intervals can be 1 sided or 2 sided. We choose the type of confidence interval based on the type of test we want to perform.
    • Confidence interval(2-sided, let’s say has 95% confidence level) can be interpreted as if the same population is sampled on numerous occasions, the resulting intervals would contain the true population parameter 95% of the time.
    • We can only guess the range that our estimated parameter falls in and not about its exact value.
    • 3 \(\sigma\) or 68-95-99.7 rule: The 68–95–99.7 rule is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively. In mathematical notation, these facts can be expressed as follows, where X is an observation from a normally distributed random variable, \(\mu\) is the mean of the distribution, and \(\sigma\) is its standard deviation:

      \(P\left(\mu-\,\,\,\sigma\leq\,X\leq\,\mu+\,\,\sigma\right)\approx\,0.6827\)
      \(P\left(\mu-2\sigma\leq\,X\leq\,\mu+2\sigma\right)\approx\,0.9545\)
      \(P\left(\mu-3\sigma\leq\,X\leq\,\mu+3\sigma\right)\approx\,0.9973\)

    • Significance level or \(\alpha\) level: The alpha level is the probability/ percentage of values that lie outside the confidence interval.

NOTE: In case of two tailed test, area under the curve(AUC) of the sampling distribution curve gives the probability of finding a specific value of statistic \((X)\) in a particular interval \((\mu – n\sigma,\,\mu + n\sigma)\,\,\,n\in\mathbf{R}\). Therefore as the confidence level increases accuracy of the estimated parameter goes down. We usually do a two tailed test. For details on one-two tailed tests: One-Two tailed tests

  • How to compute a confidence interval (when population std. deviation is known and sample size is larger than ~30)
    1. Compute the standard error of the sampling distribution \(\frac{\sigma}{\sqrt{n}}\).
    2. Choose the desired confidence level and its corresponding significance level or alpha value.
    3. Determine the value of \(z_{\alpha/2}\) (for two sided confidence interval) also called the \(z\)-score.
    4. Compute the confidence interval \({\bar{x}}\pm {{z_{\alpha/2}}\!\frac{\sigma}{\sqrt{n}}}\).

NOTE:
\(z\)-score or \(standard\)-score = \({\frac{x\,-\,\mu}{\sigma}}\,\,{\Rightarrow}\) Number of standard deviations away \(x\) is from its mean.

\(\alpha\) = 1 – \(\frac{confidence\,level}{100}\). We use \(\alpha\) for one sided test and \(\alpha\)/2 for two sided test to compute the z-score.

\(\alpha\) = \(significance\) level = \(type\,I\,\)error rate

Hypothesis
A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables.

  • Hypothesis Testing
    A formal process to determine whether to accept or reject the null hypothesis based on statistical inference.

    • Null hypothesis: What would be expected if there was nothing unusual about the data. Can be thought of as the absence of effect or generally accepted wisdom.
    • Alternative hypothesis: Opposite of the null hypothesis.
      Can be one tailed or two tailed hypothesis test.
    • There are five steps in hypothesis testing:
      1. Making assumptions
      2. Stating the research, null and alternate hypotheses and selecting (setting) alpha
      3. Selecting the sampling distribution and specifying the test statistic
      4. Computing the test statistic
      5. Making a decision and interpreting the results

We will discuss these test statistics in detail as we go along.

  • Type I & Type II errors
    • Type 1 Error: The type I error rate or significance level is the probability of rejecting the null hypothesis given that it is true. It is denoted by the Greek letter \(\alpha\) (alpha) and is also called the alpha level. Often, the significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of incorrectly rejecting the null hypothesis.
    • Type 2 Error: When one fails to reject the null hypothesis when in reality it should be rejected.
Table of error types Null hypothesis (\(H_0\)) is
True False
Decision About Null hypothesis (\(H_0\)) Reject Type I error (False Positive) Correct inference (True Positive)
Fail to reject Correct inference (True Negative) Type II error (False Negative)

NOTE: “failing to reject the null hypothesis” is NOT the same as “accepting the null hypothesis”. It simply means that the data are not sufficiently persuasive for us to prefer the alternative hypothesis over the null hypothesis. Always take the conclusions with a grain of salt.

# Problem
Assume we sample 10 (n=10) widgets and measure their thickness. The mean thickness of the widgets sampled is 7.55 units (\(\bar{x}\)=7.55) with a standard deviation of 0.1027 (s=0.1027). But we want to have widgets that are 7.5 units thick. Compute the confidence interval for the mean for a given level of Type I error (significance or alpha level or probability of incorrectly rejecting the null hypothesis).

# Solution
Let’s assume that \(\alpha\)= 0.05 or 5%

    • Null hypothesis: Mean thickness \(=\) 7.5 units
    • Alternate hypothesis: Mean thickness \(\neq\) 7.5 units(two sided test)
    • Compute the confidence interval for a given level of Type I error or alpha level.
    • If the test statistic (\(\bar{x}\)) falls within the confidence interval, the null hypothesis is rejected else it is accepted in favor of the alternative hypothesis.

NOTE: Since the sample size is small and the population std. deviation is unknown we can’t use normal distribution z-score to compute the confidence interval. Instead we will use t-distribution t-score discussed further to compute confidence interval. The statistic may be different but the approach to compute confidence interval is still the same.
For details on confidence interval and how to compute it: Confidence interval.

Relationship between Variables

Scatterplot

    • Used to determine relationship between two continuous variables.
    • One variable plotted on x-axis, another on y-axis.
    • Positive Correlation: Higher x-values correspond to higher y-values.
    • Negative Correlation: Higher x-values correspond to lower y-values.
    • Examples:
      • Body weight and BMI
      • Height and Pressure etc.

Scatterplot Matrix

Linear vs Nonlinear Relationship


Variables changing proportionately in response to each other show linear relationship. Linear relationship is an abstract concept it depends what can be called linear and what can’t in a given context. A linear relationship may exist locally with a non linear relationship globally.

Summary Tables
Method for understanding the relationship between two variables when at least one the variables is discrete.
Example: Summary information about ages of active psychologists by demographics.

Ages (1) Total Active Psychologists Active Psychologists by Gender Active Psychologists by Race/Ethnicity
(2) Female (3) Male (4) Asian (5) Black/ African American (6) Hispanic (7) White
Mean 50.5 47.9 55.1 46.5 47.9 46.4 51.1
Median 51 48 57 43 46 44 53
Std. Dev. 12.5 12.4 11.4 13.3 10.3 11.2 12.6

Discrete Variable(s): Demography: (1), (2), (3), (4), (5), (6), (7)
Continuous Variable: Age

  • Cross-Tabulation Tables/ Crosstabs/ Contingency Tables
    • Method for summarizing two categorical variables
    • In practice, continuous variables may be at times summarized as categorical variables.
    • Example: Age could be divided into categories as young, adult and senior citizen, etc. Income could be divided into categories as poor, middle class, upper middle class, wealthy, etc.
  • Correlation Coefficient
    • A quantification of the linear relationship between two variables
    • Ranges from -1 to +1
    • Used for variables on an interval or ratio scale

\(
r_{xy}\) = \( \frac{\sum_{i=1}^{i=n}\left({x_i\,-\,\bar{x}}\right)\left({y_i\,-\,\bar{y}}\right)}{\left(n\,-\,1\right)s_{x}s_{y}}
\) = \(
\frac{\sum{\left( {x_i\,-\,\bar{x}} \right)}\left( {y_i\,-\,\bar{y}} \right)}{\sqrt{\sum\left({x_i\,-\,\bar{x}}\right)^2\sum\left({y_i\,-\,\bar{y}}\right)^2}}
\)

NOTE: Correlation coefficient does not capture nonlinear relationships. Many nonlinear relationships might exist which are not captured (\(r\) = 0) by correlation coefficient.

Hypothesis Testing revisited

There are two equivalent approaches to hypothesis testing:


Critical Value approach: Critical values for a test of hypothesis depend upon a test statistic, which is specific to the type of test, and the significance level, \(\alpha\), which defines the sensitivity of the test. A value of \(\alpha\) = 0.05 implies that the null hypothesis is rejected 5 % of the time when it is in fact true. The choice of \(\alpha\) is somewhat arbitrary, although in practice values of 0.1, 0.05, and 0.01 are common. Critical values are essentially cut-off values that define regions where the test statistic is unlikely to lie; for example, a region where the critical value is exceeded with probability \(\alpha\) if the null hypothesis is true. The null hypothesis is rejected if the test statistic lies within this region which is often referred to as the rejection region(s).

  • Steps for critical value approach:
    1. Specify the null and alternative hypothesis.
    2. Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic.
    3. Using the distribution of the test statistic, look up the critical value such that the probability of making a Type I error is equal to alpha (which you specify).
    4. Compare the test statistic to the critical value. If the test statistic is more extreme in the direction of the alternative than the critical value, reject the null hypothesis. Otherwise, do not reject the null hypothesis.

P-value approach: The p-value is the probability of the test statistic being at least as extreme as the one observed given that the null hypothesis is true. A small p-value is an indication that the null hypothesis is false.

NOTE: It is good practice to decide in advance of the test how small a p-value is required to reject the test. This is exactly analogous to choosing a significance level, \(\alpha\), for test. For example, we decide either to reject the null hypothesis if the test statistic exceeds the critical value (for \(\alpha\) = 0.05) or analogously to reject the null hypothesis if the p-value is smaller than 0.05.

P-value approach is most commonly cited in literature but that is as matter of convention.