Data point(observation): A single instance or observation, usually represented as a row of a table.
Data set: A table is a form of data set. Data is usually represented in form of a table having multiple rows representing multiple observations and columns representing variables(raw) of an observation.
Random variable: A random variable, random quantity, aleatory variable, or stochastic variable is a variable quantity whose value depends on possible outcome. As a function, a random variable is required to be measurable, which rules out certain pathological cases where the quantity which the random variable returns is infinitely sensitive to small changes in the outcome.
- Random variables are of two types:
- Discrete: Countable
- Continuous: Uncountable
- Types of variables: Variables can be classified in a number of ways not necessarily mutually exclusive.
- Discrete vs Continuous:
- Discrete: Can only take a fixed number of values. e.g. car model
- Continuous: Can take any value from a continuous set. e.g. height
- Variables classified according to scale:
- Nominal/ Categorical scale: It can only take fixed number of values and can not be ordered. e.g. color.
- Ordinal scale: It can only take fixed number of values but can be ordered. e.g. low, medium, high. It is impossible to determine the magnitude of difference e.g. (high – low) doesn’t necessarily make any sense in every context.
- Interval scale: An interval variable is the one whose difference between two values is meaningful. The difference between a temperature of 100 degrees and 90 degrees is the same difference as between 90 degrees and 80 degrees.
- Ratio scale: A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. F & C are not ratio variables but K is.
- Other categories:
- Dichotomous: A variable that can contain only two values (ie: On or Off)
- Binary: A dichotomous variable that is encoded as 0 or 1 (ie: On = 0, Off = 1)
- Primary key: A variable that is a unique identifier for a particular record. (e.g. SSN may be the primary key for describing a citizen, customerId may be the primary key for describing a customer in a database)
- Discrete vs Continuous:
Big Data vs Wide Data: As additional variables are collected, data sets become “wide” e.g. DNA sequence data having millions of columns.
4 V’s of Big Data: Volume, Variety, Veracity(trustworthiness), Velocity.
Raw data: Unaltered data sets are typically referred to as “raw data”.
Features: Features are combinations of various raw variables that determine the maximum variation in data.
Dimensionality Reduction: Process of reducing the number of random variables under consideration. It is done by taking existing data and reducing it to the most discriminative components. These components allow to represent most of the information in the dataset under consideration with fewer, more discriminative features. This can be divided into feature selection and feature extraction.
Feature selection: Selecting features which are highly discriminative and determine the maximum variations within data. It requires an understanding of what aspects of dataset are important and which aren’t. Can be done by the help domain experts, clustering techniques or topical analysis.
Feature extraction: Building a new set of features from the original feature set.
Examples of feature extraction: extraction of contours in images, extraction of diagrams from a text, extraction of phonemes from recording of spoken text, etc.
Feature extraction usually involves generating new features which are composites of existing features. Both of these techniques fall into the category of feature engineering. Generally feature engineering is important to obtain the best results as it involves creating information that may not exist in the dataset, and increasing signal to noise ratio. Feature extraction involves a transformation of the features, which often is not reversible because some information is lost in the process.
Model: Mathematical model(equations) that defines the relationship between various variables of a data set that helps in predicting the values/ behavior of variables of any future unseen data point. e.g. y = mx + c, a linear model describing relationship b/w two variables x & y.
- Fundamentally the model building process is threefold:
- Model Selection: Choose or create a mathematical model.
- Training: Determine the parameters of the model that fit the training data as closely as possible.
- Testing: Evaluating the accuracy of the model on the test data so that it can be used for prediction of future unseen data.
Training data: Part of data used to determine the parameters of the model.
Testing Data: Part of data which is used to determine the accuracy of the model generated using training data. i.e. how well it works on the future unseen data.
Model Accuracy: The percentage of the unseen future cases(data) the generated model holds good for.
Overfitting: A model overfits if it works on the training set perfectly but does not predict the future cases accurately. This xkcd cartoon strip describes overfitting in real life:
Regularization: Process of determining what features should be included or weighted in your final model to avoid overfitting.
Pruning and Selection: Determining what features contain the best signal and discard the rest.
Shrinkage: Reducing the influence of some features to avoid overfitting. It can be done in multiple ways like assigning weights to variables or adding an overall cost function.
Cross Validation: Technique of simulating “out of sample” or unseen future tests to determine the accuracy of the model. Models are built and evaluated on different data sets. It helps avoid overfitting and build models that are hopefully generalizable.
Out of sample performance: If we collect data from the exact same environment, the model will be able to predict outcomes with the same degree of performance.