statistics tutorial for Interview
Summary
📊 This text covers topics related to statistics, including descriptive and inferential statistics, sampling techniques, and definitions of population and sample.
Facts
- The text discusses the difference between descriptive and inferential statistics.
- Descriptive statistics involve organizing and summarizing data, including measures of central tendency and measures of dispersion.
- Inferential statistics are used to form conclusions based on data and include tests like z-test, t-test, chi-square test, and ANOVA.
- The concept of population (capital N) and sample (small n) is introduced.
- Simple random sampling involves randomly selecting members of the population.
- Stratified sampling divides the population into non-overlapping groups (strata) for sampling.
- Examples of stratified sampling include gender-based and age-based sampling.- 📊 Overlapping professions can lead to stratified sampling.
- 🎯 Stratified sampling involves dividing a population into different layers.
- 🧑⚕️ Doctors and engineers may require different survey techniques.
- 🧮 Systematic sampling involves selecting every nth individual from a population.
- 🤔 Thanos may have used random sampling.
- 🙋♂️ Convenient sampling involves surveying domain experts.
- 🗳️ Exit polls typically use random sampling.
- 🏦 RBI household surveys may use stratified random sampling or convenience sampling.
- 💉 Drug testing may involve stratified or other sampling techniques based on the use case.
- 📊 Variables can be quantitative (measured numerically) or qualitative (categorical).
- 🔢 Quantitative variables can be discrete or continuous.
- ⚖️ Continuous variables can have decimal values, while discrete variables have whole numbers.
- 🧮 Nominal variables are categorical data with no inherent order (e.g., colors, gender).
- 🥇 Ordinal data have an order but no meaningful numerical difference (e.g., ranks).
- 🌡️ Interval data have an order, values matter, but a natural zero point is absent (e.g., temperature in Fahrenheit).
- 📏 Interval data can be used for applications like ride-sharing services.🚖 Booking a cab for six hours with variable pricing. 📊 Frequency distribution for different flower types. 📈 Frequency distribution used for creating bar and pie charts. 🔄 Cumulative frequency for calculating total occurrences. 📊 Histograms for representing continuous data. 🔍 Kernel density estimator for smoothing histograms. 🧮 Central tendency includes mean, median, and mode. 📈 Mean is influenced by outliers. 📊 Outliers can significantly affect the mean. 🧮 Median is less affected by outliers. 📈 Mode identifies the most frequent value. 📊 Mode can handle multimodal distributions. 📈 Central element of sorted data for finding median. 📈 Median calculation for odd and even data points.- 😄 The text discusses the importance of using median with outliers in calculating central tendency.
- 📊 It explains the use of mode in handling missing values for categorical variables.
- 📈 Variance is discussed as a measure of dispersion, with an example calculation.
- 📏 Standard deviation is introduced as the square root of variance and its significance in understanding data spread.
- 📊 Percentiles are explained as a way to represent data in terms of percentages.
- 🧐 The concept of quartiles is mentioned as a step towards finding outliers.📊 Distribution of Data
- Percentiles: A value below which a certain percentage of observations lie (e.g., 80th percentile means 80% of data is below that value).
- Calculation Example: Finding the percentile rank of 10 using the formula: Number of values below 10 / Sample size * 100.
- Five Number Summary: Minimum, Q1 (1st Quartile), Median, Q3 (3rd Quartile), Maximum.
- Box Plot: Visualization of the Five Number Summary, useful for identifying outliers.
- Outlier Removal: Using Interquartile Range (IQR) and lower/upper fences to detect and remove outliers.
- Variance: Formula for sample variance and its use in statistics.
- Standard Deviation: Measure of data dispersion.
- Histograms: Graphical representation of data distribution.
- Probability Density Functions (PDFs): Describing how data is distributed.
- Mean, Median, Mode: Measures of central tendency.
- Python Programming: Practical implementation of statistics concepts.
📈 Distributions Covered:
- Normal (Gaussian) Distribution
- Standard Normal Distribution
- Z-Scores
- Log-Normal Distribution
- Bernoulli Distribution
- Binomial Distribution
📊 Data Visualization Tools:
- Bar Plot
- Violin Plot
The text discusses various statistical concepts and their practical applications, including data distribution visualization, outlier detection, and statistical measures, with an emphasis on using Python for implementation.- 📊 The text discusses the concept of distributions, particularly Gaussian or normal distributions.
- 🛎️ A Gaussian distribution is characterized by a bell curve, with symmetrical sides.
- 🧮 Standard deviation is discussed, and the text mentions the empirical rule (68-95-99.7).
- 📈 Z-scores are introduced as a way to determine how many standard deviations a value is from the mean.
- 📉 Standardization is explained as converting data to have a mean of 0 and a standard deviation of 1.
- 🔄 Normalization is mentioned as a process to scale data between a specified range, such as 0 to 1.
- 🖥️ Practical applications of standardization and normalization in machine learning are mentioned.- 💡 Explanation: The text discusses the concept of pixels and their normalization using min-max scaling and Z-scores.
- 💻 Pixel Value Range: Each pixel in a 4x4 image has a value ranging from 0 to 255.
- 📊 Min-Max Scaling: Min-max scaling is a method to convert pixel values between 0 and 1, where 0 corresponds to the minimum value (0) and 255 to 1.
- 📈 Normalization: The text mentions that dividing each pixel value by 255 is another method of normalization, resulting in values between 0 and 1.
- 🧮 Z-Score Calculation: The text introduces the Z-score formula (Z = (X - μ) / σ) and applies it to data from cricket matches in 2020 and 2021.
- 🏏 Cricket Analysis: It discusses how Z-scores can be used to compare performance in cricket matches and how to interpret Z-score values.
- 📊 Z-Table: The text briefly discusses how to use a Z-table to find the area under the normal distribution curve, indicating the percentage of scores falling above a certain threshold (4.25 in this example).🔍 In this text, the following points are discussed:
- The importance of understanding the right table for obtaining specific information.
- The absence of information in the right table.
- The need to use the left table for certain information.
- An example related to z-score standardization.
- Calculating the z-score for a given IQ value.
- Explaining the concept of standard deviation.
- Identifying outliers using z-scores.
- Implementing a function to detect outliers.
Please note that the text contains both technical information and tutorial-like explanations.📊 Data Analysis:
- The speaker discusses data analysis, mentioning terms like "threshold," "standard deviation," "z score," and "outliers."
📈 Z Score Computation:
- Z score computation is explained, including sorting data, calculating q1 and q3 percentiles, and finding outliers based on z scores.
📊 Interquartile Range (IQR):
- The speaker discusses calculating the IQR, lower fence, and upper fence for outlier detection.
📉 Probability:
- The concept of probability is introduced, emphasizing its importance in various fields like machine learning.
🔗 Probability Definition:
- Probability is defined as the likelihood of an event occurring, with an example involving rolling dice and coin tossing.
📈 Addition Rule for Mutual Exclusive Events:
- The addition rule for mutually exclusive events is explained, with examples of coin tossing and dice rolling.
📈 Addition Rule for Non-Mutual Exclusive Events:
- The addition rule for non-mutually exclusive events is discussed, with an example involving drawing cards from a deck.
These topics cover discussions on data analysis, outlier detection, probability, and addition rules for both mutually exclusive and non-mutually exclusive events.- 🃏 There are 52 cards in a deck.
- 🎴 Probability of getting a Queen: 4/52
- ❤️ Probability of getting a Heart card: 13/52
- 🃏❤️ Probability of getting a Queen and a Heart card: 1/52
- 🧮 Addition Rule for Non-Mutually Exclusive Events:
- Probability of Queen or Heart = Probability of Queen + Probability of Heart - Probability of Queen and Heart
- (4/52) + (13/52) - (1/52) = 16/52
- 🎲 Probability can be divided into Independent and Dependent Events:
- Independent events are not influenced by previous events.
- Dependent events are influenced by previous events.
- 🎲 Independent events have equal probabilities for each outcome.
- 🎲 Dependent events involve conditional probabilities.
- 🎯 Permutation: Arranging objects with order matters.
- Example: Arranging chocolates in a specific order.
- Formula: nPr = n! / (n - r)! = 6P3 = 120
- 🤝 Combination: Selecting objects where order doesn't matter.
- Example: Selecting unique combinations of chocolates.
- Formula: nCr = n! / (r! * (n - r)!) = 6C3 = 20
- 📊 P-Value represents the probability of an event occurring.
- Higher P-values indicate a higher probability of an event happening.
- Lower P-values indicate a lower probability of an event happening.
- P-Value of 0.8 means 80% probability of occurrence.
- P-Value of 0.01 means 1% probability of occurrence.
- P-Value helps assess the significance of results in statistical analysis.- 🧪 Hypothesis testing involves:
- Combining topics such as confidence intervals and significance values.
- Assessing if a coin is fair through experiments and probability.
- Null and alternate hypotheses are defined in hypothesis testing.
- Experiments are performed, and the null hypothesis is either accepted or rejected.
- 📊 Confidence Intervals:
- The confidence interval is defined using significance value (alpha).
- It represents the range within which a result is considered acceptable.
- A significance value of 0.05 corresponds to a 95% confidence interval.
- 📉 Significance Value:
- Significance value (alpha) determines the width of the confidence interval.
- If the experiment falls within the interval, the null hypothesis is accepted.
- If outside the interval, the null hypothesis is rejected.
- 🧮 Type 1 and Type 2 Errors:
- Type 1 error occurs when the null hypothesis is rejected when it is true.
- Type 2 error occurs when the null hypothesis is accepted when it is false.
- These errors are important in hypothesis testing and are part of a confusion matrix.📌 Type two error is also known as false negatives. 📌 There are four possible outcomes when evaluating hypotheses. 📌 Outcome four involves accepting the null hypothesis when it is true, which is a good scenario. 📌 Confusion matrices in real-world scenarios help define true positives, true negatives, false positives, and false negatives. 📌 Determining whether a false positive is type 1 or type 2 error depends on context. 📌 One-tailed and two-tailed tests are important concepts. 📌 In a one-tailed test, you focus on one direction (e.g., greater than), while in a two-tailed test, you consider both directions (e.g., greater than or less than). 📌 Confidence intervals help estimate population parameters. 📌 Point estimate is a value of a statistic estimating a parameter. 📌 Confidence intervals consist of a point estimate plus or minus a margin of error. 📌 When population standard deviation is known, a z-test is used to find the confidence interval. 📌 The formula for the confidence interval is Point Estimate ± Z(α/2) * (Standard Deviation / √Sample Size). 📌 This formula is typically used when the sample size is greater than or equal to 30 and population standard deviation is known. 📌 Sample size and population standard deviation influence the choice of formula for confidence intervals.📊 Summary of the Text:
- The text discusses various statistical calculations and hypothesis testing procedures, particularly focusing on z-tests and confidence intervals.
- It begins by explaining how to find the z-score using a z-table.
- The text then provides an example of calculating a confidence interval with a given alpha level.
- It delves into hypothesis testing, defining the null and alternate hypotheses and setting the alpha level.
- The decision rule for a two-tailed test is explained, along with the calculation of test statistics using the z-test formula.
- The text briefly mentions the importance of the standard error in larger sample sizes and hints at the central limit theorem.
Please note that this summary includes technical content from the text and may not be easily understandable without prior knowledge of statistics.📊 Chi-Square Test
Population Information (2000 Census):
- Less than 18 years: 20%
- 18 to 35 years: 30%
- Greater than 35 years: 50%
Observed Distribution (2010 Sample, n = 500):
- Less than 18 years: 121
- 18 to 35 years: 288
- Greater than 35 years: 91
Expected Distribution Based on 2010 Sample:
- Less than 18 years: 100
- 18 to 35 years: 150
- Greater than 35 years: 250
Conclusion:
- There is a significant difference between the expected and observed distributions.
- Using alpha = 0.05, we conclude that the population distribution of ages has changed in the last 10 years.- 📊 The text discusses data analysis and hypothesis testing.
- 📈 It mentions the importance of defining null and alternate hypotheses.
- 📏 It specifies an alpha value of 0.05 for a 95% confidence interval.
- 📊 The text explains the calculation of degrees of freedom for a chi-square test.
- 📈 It discusses chi-square tests and decision boundaries.
- 📉 It calculates the chi-square test statistic using observed and expected values.
- 📊 The text mentions the significance level (alpha) and p-values in hypothesis testing.
- 📈 It introduces covariance as a measure of the relationship between two variables.
- 📉 It explains positive, negative, and zero covariance values.
- 📊 The text highlights the limitation of covariance in not providing a fixed magnitude for correlation.
- 📈 It hints at the need for a correlation coefficient like Pearson correlation to measure correlation strength.- 📊 The Pearson correlation coefficient restricts values between -1 and +1.
- 🧮 It measures the degree of correlation between two variables.
- ➡️ A positive correlation (towards +1) means variables move together.
- ⬅️ A negative correlation (towards -1) means variables move opposite.
- ✍️ Formula: Pearson correlation = Covariance(X, Y) / (Std Dev(X) * Std Dev(Y))
- 📈 Correlation values range from -1 to +1.
- 📉 Negative correlation when X decreases, Y increases.
- 📊 Positive correlation when X increases, Y increases.
- ⚖️ Values on a straight line have a correlation of -1 or +1.
- 🧐 Non-linear properties better captured by Spearman rank correlation.
- 📝 Spearman formula: Covariance(rank(X), rank(Y)) / (Std Dev(rank(X)) * Std Dev(rank(Y)))
- 📊 Spearman rank correlation captures non-linear relationships.
- 🧑🎓 Understanding rank: Assign ranks to data points to compute Spearman correlation.
- 📊 T-test used to compare sample mean with population mean.
- 📊 If p-value < 0.05, reject the null hypothesis.
- 📊 Visualization tools like pair plots and correlation matrices help analyze correlations.
- 📊 Correlation can be positive (variables move together) or negative (variables move oppositely).
- 📊 Spearman rank correlation is used when non-linear relationships are expected.📊 Summary of the text:
- 💡 Explains the significance of p-values in statistical testing.
- 🧪 Discusses how p-values relate to null hypothesis testing.
- 📝 Provides an example of a z-test problem with calculations.
- 📈 Demonstrates the calculation of p-values based on z-scores.
- 🤝 Emphasizes the importance of comparing p-values to significance levels (alpha) for hypothesis testing.
- 📚 Mentions topics to be covered in future sessions, including distributions, central limit theorem, and F-tests.
- 🔀 Describes the process of rejecting or failing to reject the null hypothesis based on p-values and significance levels.- 🔍 The problem involves hypothesis testing and statistical analysis.
- 📊 Average age of a college is 24 years with a standard deviation of 1.5.
- 🧪 A sample of 35 (or 36) students is taken.
- 📈 The sample mean age is 25 years.
- 📊 Hypotheses:
- H0 (null hypothesis): Mean age = 24 years.
- H1 (alternative hypothesis): Mean age ≠ 24 years.
- 🧮 Standard deviation (σ) is 1.5, sample size (n) is 36, and sample mean (x̄) is 25.
- 📝 Significance level (alpha) is 0.05.
- 🧮 It's a two-tailed test.
- 📉 Calculate the z-score: (25 - 24) / (1.5 / √36) = 1.2
- 📈 Decision boundary is ±1.96 for a 95% confidence interval.
- 🚫 1.2 < 1.96, so reject the null hypothesis.
- 📊 Calculate the p-value: p ≈ 0.403.
- 🚫 The p-value is less than alpha (0.05), so reject the null hypothesis.
- 📊 Discusses various probability distributions: Bernoulli, Binomial, and Pareto.
- 📉 Explains the relationship between power law (Pareto) and log-normal distributions.
- 📦 Mentions data transformations for normalizing distributions, including Box-Cox transformation.
- 📄 Central Limit Theorem is briefly mentioned as applicable to various distributions.
- 📊 When taking multiple samples (n ≥ 30) from data, they tend to follow a normal distribution due to the Central Limit Theorem.
- 📈 The more samples (m), the better the Central Limit Theorem applies.
- 🧮 Sample size (n ≥ 30) and the number of samples (m) are crucial for the Central Limit Theorem.
- 📈 Populating all sample means results in a normal distribution, regardless of the original data distribution.
- 📊 Some data distributions mentioned include normal, Poisson, and Pareto distributions.