The Basics of Statistics for Data Science By Statisticians

Data science has become a boom in the current industry. It is one of the most popular technologies these days. Most of the statistics students want to learn data science. Because statistics is the building block of the machine learning algorithms. But most of the students don’t know how much statistics they need to know to start data science. To overcome this problem we are going to share with you the best ever tips on statistics for data science. In this blog, you are going to see which statistics are crucial to start with data science. 

Introduction to Statistics

Statistics is one of the most crucial subjects for the students. It has various methods that are helpful to solve the most complex problems of real life. Statistics is almost everywhere. Data science and data analysts use it to have a look on the meaningful trends in the world. Besides, statistics has the power to drive meaningful insight from the data.

Statistics offers a variety of functions, principles, and algorithms. That is helpful to analyze raw data, build a Statistical Model and infer or predict the result.

Terminologies in Statistics

Before getting started with data science; we have to be well aware of the key statistical terminologies.

The population: It is the set of the given sources from which the data has to be collected. There can be a huge number of population.

Sample: It is the subset of data that is extracted from the given Population.

Variable: It is the characteristics, number, or quantity of the data that can be measured or counted. In other words the variable is the data item.

statistical model: The statistical model is also known as statistical Parameter or population parameter. 

Types of Analysis

Statistics has two types of analysis.

Quantitative Analysis: Quantitative Analysis is also known as statistical analysis. It is the science or an art of collecting and interpreting data with numbers and graphs. We also use it to identify patterns and trends.

Qualitative Analysis: Qualitative is also known as Non-Statistical Analysis. It gives generic information. It also uses text, sound and other forms of media.

Data Types

Numerical: Numerical data types are those data types which are expressed with digits. These data types are measurable. There are two major types of data types i.e. discrete and continuous.

Categorical: Categorical data types are qualitative data and it is classified into categories. There are two types of major categorical data types i.e. nominal (no order) or ordinal (ordered data).

Measures of Central Tendency

Mean:  Means stands for the average of the given dataset.

Median: Median is the middle of the given ordered dataset.

Mode: Mode is the most common value in a given dataset. It is the only relevant for discrete data.

Measures of Variability

Range: Range is the difference between the maximum and minimum value in a given dataset.

Variance (σ2): Variance measures how spread out a set of the given data is relative to the mean.

Standard Deviation (σ): It is also a measurement of how spread out numbers are in the given data set.  Square root of variance is also known as standard deviation.

Z-score: Z score determines the number of standard deviations a data point is from the mean.

R-Squared: R square is a statistical measure of fit. It used to indicate how much variation of a dependent variable is explained by the independent variable(s). We can use it only for the  simple linear regression.

Adjusted R-squared: It is similar to the R squared and also R square modified version. It  has been adjusted for the number of predictors in the model. It decreases if the old term improves the model more than would be expected by chance and vice versa.

Measurements of Relationships between Variables

Covariance: If we want to find the difference between two variables then we use the covariance. It is based on the philosophy that if  it is  positive then they tend to move in the same direction. Or  if it’s negative then they tend to move in opposite directions. There will also be no relation with each other,  if they are zero.

Correlation: Correlation is all about to measure the strength of a relationship between two different variables. It ranges from -1 to 1. It is the normalized version of co-variance. Most of the time the correlation of +/- 0.7 represents a strong relationship between two different variables. On the other hand, there is  no relationship between variables when the correlations between -0.3 and 0.3

Probability Distribution Functions

Probability Density Function (PDF): It is for continuous data. Hereby in the continuous data the value at any point can be interpreted as providing a relative likelihood. In addition, the value of the random variable will also be equal to that sample.

Probability Mass Function (PMF): In the probability mass function for a discrete data. It also gives the probability of a given occurring value.

Cumulative Density Function (CDF): The cumulative density function is used to tell us the probability that the random variable is less than a certain value. In addition is also  the integral of the PDF.

Continuous Data Distributions

Continuous Distribution: Continuous data distributions is a probability distribution. In this distribution all the outcomes are equally likely.

Normal/Gaussian Distribution: The normal distribution is commonly referred to as the bell curve. In addition it is also related to the central limit theorem. It has the standard deviation of 1 and the mean is 0.

T-Distribution: The T distribution is another  probability distribution. It is used to estimate population parameters when the sample size is small. 

Uniform Distribution: In this probability distribution we have the single value that only occurs within the certain range. The value outside this range is just 0. It is also known as on and off distribution. 

Position Distribution: It is quite similar to the normal distribution. But it offers the addition factor i.e. the skewness. The lower the value of the skewness the distribution will relatively uniformly spread in all directions. But if the skewness is high then the data will spread out in different directions with unequal distribution,

Discrete Data Distributions

Poisson Distribution: One of the most common probability distributions. It expresses the probability of a given number of events occurring within a given fixed time period.

Binomial Distribution: The probability distribution of the number of successes in a sequence of n independent experiences each with its own Boolean-valued outcome (p, 1-p).

Moments

The Moments describe different aspects of nature and the shape of any given distribution. Moments happened in sequence therefore the means is the first moment, the variance is the second one, skewness is the third one and the kurtosis is the fourth one and the last one.

Probability

Probability is all about the likelihood that the event is occurring.

Conditional Probability:- In this probability [P(A|B)] is the likelihood of an event occurring.The event occurring is based on the occurrence of an event that occurred previously

Bayes’ Theorem: The Bayes’ theorem is the most popular mathematical formula. It is used to determine the conditional probability. It is based on the methodology that the probability of A given B is equal to the probability of B given A times the probability of A over the probability of B”.

Accuracy

True positive: It detects the condition, if  the condition is present. 

True negative: It does not detect the condition, if the condition is not present.

False-positive: It automatically detects the condition if  the condition is absent.

False-negative: It does not detect the condition if  the condition is present.

Sensitivity: It measures the ability of a test to detect the condition. If the condition is present. The sensitivity = TP/(TP+FN)

Specificity: It measures the ability of a test to correctly exclude the condition if the condition is absent. It specificity = TN/(TN+FP)

Predictive value positive: Predictive value positive is also called as precision. In this the proportion of positives that correspond to the presence of the condition. Here is the formula  PVP = TP/(TP+FP)

Predictive value negative: In this  the proportion of negatives. It also corresponds to the absence of the condition. Here is the formula PVN = TN/(TN+FN)

Conclusion

Now we have gone through all the basic concepts of statistics for data science. If you are going to start with data science then you should try to have a good command over all these statistics concepts. It will help you a lot when you start learning data science. With the help of these concepts you will be able to understand the data science concepts. So what are you waiting for? Grab the best statistics books and start learning these concepts.

If you are already learning python and need help with python homework then we are here to provide you the best python homework help. We are also offering the python programming homework help and help with python homework.

Related Posts
Expert Advice on How Important is Math for Data Science

Data science is one of the fastest-growing technologies in the world. There are lots of job opportunities in the data Read more

R vs Python: Which Programming Language is Better for You?

R vs Python is one of the most common but important questions asked by lots of data science students. We Read more

Leave a Comment