Aha! What an opportunity to put my skills to good use. The task was pretty simple, choose a random episode from a random season and play it in VLC media player. So I researched on how to choose random file from a random folder in command line and implemented that solution as a small bash script using sort
command
#!/bin/bash
dir="../Friends"
season=`ls ${dir} | sort -R | head -1`
ep=`ls "${dir}/${season}/"*.mkv | sort -R | head -1`
/Applications/VLC.app/Contents/MacOS/VLC "$ep"
and set its alias to friends
so that I could play a random episode just by typing friends
in the command line. I was happy that a random friends episode was just a couple of keystrokes away. And I started enjoying episodes from this self written script but my happiness didn’t last long. I observed that the more episodes I played the more they kept repeating which, I thought, ought to happen because under the hood sort
method throws 234 faced die corresponding to each episode to play and you will sometimes get same number more than once but it wasn’t always the case. There were some episodes that were repeating quite too often and some not at all.
So to understand what’s happening here let’s recall some high school probability. If you throw a unbiased die there is an equal chance of getting a number from 1 to 6 because all the numbers have the same probability of 1/6 i.e. ideally you will get every number once in 6 throws. So it has a uniform distribution. However, if you have a biased die it may not be the case. This made me suspicious about the working of the script I implemented. To investigate this I visualised the output of the script by running it a million times and recording the output and found something unexpected.
For data nerds out there this process of running an experiment multiple times is called Montè Carlo simulations
Apparently the 234 faced die that the sort method of the script used was not completely unbiased as I suspected earlier. As you can see it has a non uniform distribution across episodes which explains why it ended up playing some episodes more often than others.
So, I had to throw away the whole script and think of an alternate solution starting from the underlying distribution. Since I didn’t know how to generate a uniform distribution from command line that’s where Python came to the rescue. I wrote another script, this time in Python, to generate a uniform distribution and pick a number randomly from that distribution which was like choosing a number from an unbiased 234 faced die.
#!/usr/bin/env python
import pandas as pd
import numpy as np
import subprocess
data = pd.read_csv('../Friends/frnds.csv')
p = int(np.random.uniform(0,234))
episode = data.iloc[p,:][1]
p = subprocess.Popen(["/Applications/VLC.app/Contents/MacOS/VLC", episode], stdout=subprocess.PIPE)
p.communicate()
This worked like a charm and gave the following distribution
which seemed pretty uniform in a million trials. Thanks to the uniform distribution life was sweet again and that’s how I learnt why it’s important to choose your distribution wisely.
]]>Aha! What an opportunity to put my skills to good use. The task was pretty simple, choose a random episode from a random season and play it in VLC media player. So I researched on how to choose random file from a random folder in command line and implemented that solution as a small bash script using sort
command
#!/bin/bash
dir="../Friends"
season=`ls ${dir} | sort -R | head -1`
ep=`ls "${dir}/${season}/"*.mkv | sort -R | head -1`
/Applications/VLC.app/Contents/MacOS/VLC "$ep"
and set its alias to friends
so that I could play a random episode just by typing friends
in the command line. I was happy that a random friends episode was just a couple of keystrokes away. And I started enjoying episodes from this self written script but my happiness didn’t last long. I observed that the more episodes I played the more they kept repeating which, I thought, ought to happen because under the hood sort
method throws 234 faced die corresponding to each episode to play and you will sometimes get same number more than once but it wasn’t always the case. There were some episodes that were repeating quite too often and some not at all.
So to understand what’s happening here let’s recall some high school probability. If you throw a unbiased die there is an equal chance of getting a number from 1 to 6 because all the numbers have the same probability of 1/6 i.e. ideally you will get every number once in 6 throws. So it has a uniform distribution. However, if you have a biased die it may not be the case. This made me suspicious about the working of the script I implemented. To investigate this I visualised the output of the script by running it a million times and recording the output and found something unexpected.
For data nerds out there this process of running an experiment multiple times is called Montè Carlo simulations
Apparently the 234 faced die that the sort method of the script used was not completely unbiased as I suspected earlier. As you can see it has a non uniform distribution across episodes which explains why it ended up playing some episodes more often than others.
So, I had to throw away the whole script and think of an alternate solution starting from the underlying distribution. Since I didn’t know how to generate a uniform distribution from command line that’s where Python came to the rescue. I wrote another script, this time in Python, to generate a uniform distribution and pick a number randomly from that distribution which was like choosing a number from an unbiased 234 faced die.
#!/usr/bin/env python
import pandas as pd
import numpy as np
import subprocess
data = pd.read_csv('../Friends/frnds.csv')
p = int(np.random.uniform(0,234))
episode = data.iloc[p,:][1]
p = subprocess.Popen(["/Applications/VLC.app/Contents/MacOS/VLC", episode], stdout=subprocess.PIPE)
p.communicate()
This worked like a charm and gave the following distribution
which seemed pretty uniform in a million trials. Thanks to the uniform distribution life was sweet again and that’s how I learnt why it’s important to choose your distribution wisely.
]]>Critical Value approach: Critical values for a test of hypothesis depend upon a test statistic, which is specific to the type of test, and the significance level, 𝛼, which defines the sensitivity of the test. A value of 𝛼 = 0.05 implies that the null hypothesis is rejected 5 % of the time when it is in fact true. The choice of 𝛼 is somewhat arbitrary, although in practice values of 0.1, 0.05, and 0.01 are common. Critical values are essentially cut-off values that define regions where the test statistic is unlikely to lie; for example, a region where the critical value is exceeded with probability 𝛼 if the null hypothesis is true. The null hypothesis is rejected if the test statistic lies within this region which is often referred to as the rejection region(s).
Steps for critical value approach:
P-value approach: The p-value is the probability of the test statistic being at least as extreme as the one observed given that the null hypothesis is true. A small p-value is an indication that the null hypothesis is false.
NOTE: It is good practice to decide in advance of the test how small a p-value is required to reject the test. This is exactly analogous to choosing a significance level, 𝛼, for test. For example, we decide either to reject the null hypothesis if the test statistic exceeds the critical value (for 𝛼 = 0.05) or analogously to reject the null hypothesis if the p-value is smaller than 0.05.
P-value approach is most commonly cited in literature but that is as matter of convention.
]]>Variables changing proportionately in response to each other show linear relationship. Linear relationship is an abstract concept it depends what can be called linear and what can’t in a given context. A linear relationship may exist locally with a non linear relationship globally.
Summary Tables
Method for understanding the relationship between two variables when at least one the variables is discrete. Example: Summary information about ages of active psychologists by demographics.
Ages | (1) Total Active Psychologists | Active Psychologists by Gender | Active Psychologists by Race/Ethnicity | ||||
---|---|---|---|---|---|---|---|
(2) Female | (3) Male | (4) Asian | (5) Black/ African American | (6) Hispanic | (7) White | ||
Mean | 50.5 | 47.9 | 55.1 | 46.5 | 47.9 | 46.4 | 51.1 |
Median | 51 | 48 | 57 | 43 | 46 | 44 | 53 |
Std. Dev. | 12.5 | 12.4 | 11.4 | 13.3 | 10.3 | 11.2 | 12.6 |
Discrete Variable(s): Demography: (1), (2), (3), (4), (5), (6), (7) Continuous Variable: Age
Cross-Tabulation Tables/ Crosstabs/ Contingency Tables
Correlation Coefficient
NOTE: Correlation coefficient does not capture nonlinear relationships. Many nonlinear relationships might exist which are not captured (𝑟 = 0) by correlation coefficient.
]]>Types of statistics:
Methods of sampling data:
Types of Statistical Studies:
Running an experiment is the best way to conduct a statistical study. The purpose of a sample study is to estimate certain population parameter while observational study and experiment is to compare two population parameters.
Describing data Central Tendency: There are different ways of understanding central tendency Mean: Arithmetic mean value Median: Middle value Mode: Highest frequency value e.g. Samples of observations of a variable
\(x_i\) = 2, 4, 7, 11, 16.5, 16.5, 19
\(n\) = 7
Mean \(\bar{x} = \sum_{i=1}^n \frac{x_i}{n}\) = (2 + 4 + 7 + 11 + 15 + 16.5 + 19)/7 = 10.643
Median = 11
Mode = 16.5
Median is preferred when the data is skewed or subject to outliers.
WARNING: A median value significantly larger than the mean value should be investigated!
Measuring spread of data:
Range: Maximum value – Minimum value = 19 – 2 = 17
Variance: \(s_{n-1}^2 = \sum_{i=1}^n\frac{\left({x_i} – \bar{x}\right)^2}{n-1}\)
Standard Deviation: \(s_{n-1} = \sqrt{Variance}\)
Range is a quick way to get an idea of the spread.
IQR takes longer to compute but it sometimes gives more useful insights like outliers or bad data points etc.
Interquartile Range: IQR is amount of spread in the middle 50% of the data set. In the previous e.g.
Q1(25% of data) = (2 + 4 + 7 + 11)/4 = 6
Q2(50% of data) = 11
Q3(75% of data) = (11 + 16.5 + 16.5 + 19)/4 = 15.75
IQR = Q3 – Q1 = 15.75 – 6 = 9.75
Questioning the underlying reason for distributional non-unimodality frequently leads to greater insight and improved deterministic modeling of the phenomenon under study.
Plotting Data
Both display relative frequencies of different variables. With bar charts, the labels on the X axis are categorical; with histograms, the labels are quantitative. Both are useful in detecting outliers(odd data points).
Shape
Skewness = \(\frac{1}{N}\sum_{i=1}^{N}\frac{(x_i-\overline{x})^3}{\sigma^3}\)
Kurtosis = \(\frac{1}{N}\sum_{i=1}^{N}\frac{(x_i-\overline{x})^4}{\sigma^4}\)
Sample Statistic and Population Parameter
Each sample statistic has a corresponding unknown population value called a parameter. e.g. population mean, variance etc. are called parameter whereas sample mean, variance etc. are called statistic.
Sample Statistic | Population Parameter | |
---|---|---|
Mean | \(\bar{x}=\sum_{i=1}^n\frac{x_i}{n}\) | \(\mu=\sum_{i=1}^N\frac{x_i}{N}\) |
Variance | \(s_{n-1}^2=\sum_{i=1}^n\frac{({x_i-\bar{x}})^2}{n-1}\) | \(\sigma^2=\sum_{i=1}^N\frac{({x_i-\mu})^2}{N}\) |
Standard Deviation | \(s\) or \(s_{n-1}\) | \(\sigma\) |
There are many more sample statistics and their corresponding population parameters.
Probability
Probability: The likelihood of an event occurring.
Probability of an event = \(\frac{\text{# of favourable outcomes}}{\text{Total # of possible outcomes}}\)
Conditional Probability: The probability of an event occurring given that another event has occurred.
Conditional Probability of an event \(P\left(A\vert{B}\right) = \frac{P\left(A\cap{B}\right)}{P\left(B\right)} \implies\) A is dependent on B
Bayes Theorem: \(P\left(A\vert{B}\right) = \frac{P\left(B\vert{A}\right)P\left(A\right)}{P\left(B\right)}\)
Probability Distribution:
A mathematical function that, stated in simple terms, can be thought of as providing the probability of occurrence of different possible outcomes in an experiment. Let’s say we have a random variable 𝑋 = # of HEADS from flipping a coin 5 times.
Central Limit Theorem
Suppose that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of other observations and arithmetic average of the observations is computed. If this procedure of random sampling and computing the average of observations is performed many times, the central limit theorem says that the computed values of the average will be distributed according to the normal distribution (commonly known as a “bell curve”). A simple example of this is that if one flips a coin many times the probability of getting a given number of heads in a series of flips should follow a normal curve, with mean equal to half the total number of flips in each series as shown previously.
Sampling distribution of the sample mean
Random variables can have different distribution patterns. They can be normal or multi-modal as shown below.
To plot a sampling distribution of sample means(can be mode, median etc.) we draw samples of certain size(say 3) from a distribution and compute its mean.
Note: The mean of sampling distribution(mean of means) is same as the population mean \((\mu_{x} = \mu)\). As the number of samples \((S_{i})\) approach infinity the curve approximates a normal distribution.
Standard Error
Variance of the sampling distribution of the sample mean. The standard error of the mean is the expected value(average) of the standard deviation of several samples, this is estimated from a single sample as:
\(SE_{\bar{x}}^2\) \(=\) \({s^2}\over{n}\) \(\implies\) larger the sample size lower the variance. \(s\) is standard deviation of the sample, \(n\) is the sample size.
WARNING:
\(SE_{\bar{x}}\) = sampling distribution standard deviation (not sample standard deviation).
Confidence Interval
𝑃(𝜇 − 𝜎 ≤ 𝑋 ≤ 𝜇 + 𝜎) ≈ 0.6827
𝑃(𝜇 − 2𝜎 ≤ 𝑋 ≤ 𝜇 + 2𝜎) ≈ 0.9545
𝑃(𝜇 − 3𝜎 ≤ 𝑋 ≤ 𝜇 + 3𝜎) ≈ 0.9973
NOTE: In case of two tailed test, area under the curve(AUC) of the sampling distribution curve gives the probability of finding a specific value of statistic (𝑋) in a particular interval (𝜇 – 𝑛𝜎, 𝜇 + 𝑛𝜎), 𝑛 ∈ 𝐑. Therefore as the confidence level increases accuracy of the estimated parameter goes down. We usually do a two tailed test. For details on one-two tailed tests: One-Two tailed tests
How to compute a confidence interval (when population std. deviation is known and sample size is larger than ~30) Compute the standard error of the sampling distribution \({\sigma}\over{\sqrt{n}}\). Choose the desired confidence level and its corresponding significance level or alpha value. Determine the value of \(z_{\alpha \over {2}}\) (for two sided confidence interval) also called the 𝑧-score. Compute the confidence interval \(\bar{x}{\pm}{z_{\alpha/2}}\frac{\sigma}{\sqrt{n}}\)
NOTE: 𝑧-score or 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑-score = (𝑥−𝜇)/𝜎 ⇒ Number of standard deviations away 𝑥 is from its mean.
𝛼 = 1 – 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 / 100. We use 𝛼 for one sided test and 𝛼/2 for two sided test to compute the z-score.
𝛼 = 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 level = 𝑡𝑦𝑝𝑒 𝐼 error rate
Hypothesis A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables.
Hypothesis Testing A formal process to determine whether to accept or reject the null hypothesis based on statistical inference.
We will discuss these test statistics in detail as we go along.
Type I & Type II errors
Table of error types | Null hypothesis $$(H_0)$$ | ||
---|---|---|---|
True | False | ||
Decision About Null hypothesis $$(H_0)$$ | Reject | Type I error (False Positive) | Correct inference (True Positive) |
Fail to reject | Correct inference (True Negative) | Type II error (False Negative) |
NOTE: “failing to reject the null hypothesis” is NOT the same as “accepting the null hypothesis”. It simply means that the data are not sufficiently persuasive for us to prefer the alternative hypothesis over the null hypothesis. Always take the conclusions with a grain of salt.
Problem Assume we sample 10 (n=10) widgets and measure their thickness. The mean thickness of the widgets sampled is 7.55 units \((\bar{x}=7.55)\) with a standard deviation of 0.1027 (s=0.1027). But we want to have widgets that are 7.5 units thick. Compute the confidence interval for the mean for a given level of Type I error (significance or alpha level or probability of incorrectly rejecting the null hypothesis).
Solution Let’s assume that 𝛼 = 0.05 or 5%
NOTE: Since the sample size is small and the population std. deviation is unknown we can’t use normal distribution z-score to compute the confidence interval. Instead we will use t-distribution t-score discussed further to compute confidence interval. The statistic may be different but the approach to compute confidence interval is still the same. For details on confidence interval and how to compute it: Confidence interval.
]]>Data set: A table is a form of data set. Data is usually represented in form of a table having multiple rows representing multiple observations and columns representing variables(raw) of an observation.
Random variable: A random variable, random quantity, aleatory variable, or stochastic variable is a variable quantity whose value depends on possible outcome. As a function, a random variable is required to be measurable, which rules out certain pathological cases where the quantity which the random variable returns is infinitely sensitive to small changes in the outcome.
Random variables are of two types:
Types of variables: Variables can be classified in a number of ways not necessarily mutually exclusive.
4 V’s of Big Data: Volume, Variety, Veracity(trustworthiness), Velocity.
Raw data: Unaltered data sets are typically referred to as “raw data”.
Features: Features are combinations of various raw variables that determine the maximum variation in data.
Dimensionality Reduction: Process of reducing the number of random variables under consideration. It is done by taking existing data and reducing it to the most discriminative components. These components allow to represent most of the information in the dataset under consideration with fewer, more discriminative features. This can be divided into feature selection and feature extraction.
Feature selection: Selecting features which are highly discriminative and determine the maximum variations within data. It requires an understanding of what aspects of dataset are important and which aren’t. Can be done by the help domain experts, clustering techniques or topical analysis.
Feature extraction: Building a new set of features from the original feature set. Examples of feature extraction: extraction of contours in images, extraction of diagrams from a text, extraction of phonemes from recording of spoken text, etc. Feature extraction usually involves generating new features which are composites of existing features. Both of these techniques fall into the category of feature engineering. Generally feature engineering is important to obtain the best results as it involves creating information that may not exist in the dataset, and increasing signal to noise ratio. Feature extraction involves a transformation of the features, which often is not reversible because some information is lost in the process.
Model: Mathematical model(equations) that defines the relationship between various variables of a data set that helps in predicting the values/ behavior of variables of any future unseen data point. e.g. y = mx + c, a linear model describing relationship b/w two variables x & y.
Fundamentally the model building process is threefold:
Training data: Part of data used to determine the parameters of the model.
Testing Data: Part of data which is used to determine the accuracy of the model generated using training data. i.e. how well it works on the future unseen data.
Model Accuracy: The percentage of the unseen future cases(data) the generated model holds good for.
Overfitting: A model overfits if it works on the training set perfectly but does not predict the future cases accurately. This xkcd cartoon strip describes overfitting in real life:
Regularization: Process of determining what features should be included or weighted in your final model to avoid overfitting.
Pruning and Selection: Determining what features contain the best signal and discard the rest.
Shrinkage: Reducing the influence of some features to avoid overfitting. It can be done in multiple ways like assigning weights to variables or adding an overall cost function.
Cross Validation: Technique of simulating “out of sample” or unseen future tests to determine the accuracy of the model. Models are built and evaluated on different data sets. It helps avoid overfitting and build models that are hopefully generalizable.
Out of sample performance: If we collect data from the exact same environment, the model will be able to predict outcomes with the same degree of performance.
]]>