A probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This range will be bounded between the minimum and maximum possible values, but precisely where the possible value is likely to be plotted on the probability distribution depends on a number of factors. These factors include the distribution’s mean (average), standard deviation, skewness, and kurtosis.
Mostly used types of Distributions are following:
The Bernoulli distribution is one of the easiest distributions to understand and can be used as a starting point to derive more complex…
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. It is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point…
Naive Bayes Classifier is a classification algorithm that uses the Bayes theorem for classification. It gives very good results when it comes to NLP tasks such as sentimental analysis. It is a fast and uncomplicated classification algorithm. This algorithm is quite popular to be used in NLP.
To understand the Naive Bayes Classifier, we need to understand the Bayes theorem.
Bayes Theorem is a principled way of calculating a conditional probability. Conditional probability is the probability that something will happen, given that something else has already occurred. …
Extremely Randomized Trees, or Extra Trees for short, is an ensemble machine learning algorithm based on decision trees. The Extra Trees algorithm works by creating a large number of unpruned decision trees from the training dataset. Predictions are made by averaging the prediction of the decision trees in the case of regression or using majority voting in the case of classification. The predictions of the trees are aggregated to yield the final prediction, by majority vote in classification problems and arithmetic average in regression problems.
There are three main hyperparameters to tune in the algorithm; they are the number of…
In order to understand semi-supervised learning, we should understand supervised and unsupervised learning first.
A supervised learning algorithm learns from labeled training data, helps you to predict outcomes for unforeseen data.
An unsupervised learning algorithm learns patterns in unlabeled data.
Semi-supervised machine learning is a combination of supervised and unsupervised machine learning methods. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). …
XGBoost stands for eXtreme Gradient Boosting.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. It is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost has been dominating machine learning and Kaggle competitions for structured or tabular data.
XGBoost algorithm was developed as a research project at the University of Washington. Tianqi Chen and Carlos Guestrin presented their paper at SIGKDD Conference in 2016 and caught the Machine Learning world by fire.
To understand XGBoost, we must first understand Gradient Descent and Gradient Boosting. Gradient Descent is an iterative optimization…
Apache Spark is an open-source parallel processing framework for storing and processing Big Data across clustered computers. Spark can be used to perform computations much faster than Hadoop can rather Hadoop and Spark can be used together efficiently. Spark is written in Scala, which is considered the primary language for interacting with the Spark Core engine, but it doesn’t require developers to know Scala, which executes inside a Java Virtual Machine (JVM). APIs for Java, Python, R, and Scala ensure Spark is within reach of a wide audience of developers, and they have embraced the software.
Spark uses a master-slave…
Neural Networks is one of the most powerful and widely used algorithms in machine learning. A neural network works similarly to the human brain’s neural network.
Neural networks are set layers of highly interconnected processing elements (neurons) that make a series of transformations on the data to generate its own understanding of it(what we commonly call features).
Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produce more accurate solutions than a single model would.
The main causes of error in learning are due to noise, bias, and variance. Ensemble helps to minimize these factors. These methods are designed to improve the stability and the accuracy of Machine Learning algorithms.
The two main types of Ensemble methods are Bagging and Boosting.
In this blog, I will explain the difference between Bagging and Boosting ensemble methods.
Bagging is a Parallel ensemble method (stands for Bootstrap Aggregating), is a…
MapReduce is a programming model that allows you to process your data across an entire cluster. It provides access to high-level applications using scripts in languages such as Hive and Pig, and programming languages as Scala and Python.
MapReduce consists of Mappers and Reducers that are different scripts, which you might write, or different functions you might use when writing a MapReduce program. MapReduce makes the use of two functions.