Image for post
Image for post

XGBoost stands for eXtreme Gradient Boosting.

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. It is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost has been dominating machine learning and Kaggle competitions for structured or tabular data.

XGBoost algorithm was developed as a research project at the University of Washington. Tianqi Chen and Carlos Guestrin presented their paper at SIGKDD Conference in 2016 and caught the Machine Learning world by fire.

How does it work?

To understand XGBoost, we must first understand Gradient Descent and Gradient Boosting. Gradient Descent is an iterative optimization…


Image for post
Image for post

Apache Spark is an open-source parallel processing framework for storing and processing Big Data across clustered computers. Spark can be used to perform computations much faster than Hadoop can rather Hadoop and Spark can be used together efficiently. Spark is written in Scala, which is considered the primary language for interacting with the Spark Core engine, but it doesn’t require developers to know Scala, which executes inside a Java Virtual Machine (JVM). APIs for Java, Python, R, and Scala ensure Spark is within reach of a wide audience of developers, and they have embraced the software.

Structure

Spark uses a master-slave…


Image for post
Image for post

Neural Networks is one of the most powerful and widely used algorithms in machine learning. A neural network works similarly to the human brain’s neural network.

Neural networks are set layers of highly interconnected processing elements (neurons) that make a series of transformations on the data to generate its own understanding of it(what we commonly call features).

Types of Layers in Neural Network

  • Input layer: It is used to pass in our input(an image, text or any suitable type of data for NN).
  • Hidden Layer: These are the layers in between the input and output layers. …

Image for post
Image for post

Ensemble Methods

Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produce more accurate solutions than a single model would.

The main causes of error in learning are due to noise, bias, and variance. Ensemble helps to minimize these factors. These methods are designed to improve the stability and the accuracy of Machine Learning algorithms.

The two main types of Ensemble methods are Bagging and Boosting.

In this blog, I will explain the difference between Bagging and Boosting ensemble methods.

Bagging

Bagging is a Parallel ensemble method (stands for Bootstrap Aggregating), is a…


Image for post
Image for post

MapReduce is a programming model that allows you to process your data across an entire cluster. It provides access to high-level applications using scripts in languages such as Hive and Pig, and programming languages as Scala and Python.

MapReduce consists of Mappers and Reducers that are different scripts, which you might write, or different functions you might use when writing a MapReduce program. MapReduce makes the use of two functions.

  • Map() performs sorting and filtering of data and thereby organizing them in the form of a group. …

Image for post
Image for post

Reinforcement learning is an approach to machine learning that is inspired by behaviorist psychology. Reinforcement learning contrasts with other machine learning approaches in that the algorithm is not explicitly told how to perform a task, but works through the problem on its own.

Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. …


Image for post
Image for post

Support Vector Machine(SVM) is a supervised learning algorithm that can be used for classification and regression problems. Support Vector Machine for classification is called Support Vector Classification(SVC) and for regression is called Support Vector Regression(SVR).

How does SVM work?

SVM works on the idea of finding a hyperplane that best separates features into different domains. Let's work with an example to fully understand how SVM works. Let’s imagine we have two tags: red and blue, and our data has two features: x and y.


Image for post
Image for post

Before talking about Hadoop, I think we should talk about Big Data. Big data is a term used for incredibly large datasets that cannot be stored or processed efficiently with traditional methods.

Hadoop is an open-source framework used to solve big data problems efficiently. Hadoop allows distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions.

In this blog, I will introduce all the components of the Hadoop ecosystem.

HDFS(Hadoop Distributed File System)


Image for post
Image for post

Decision Tree is a supervised machine learning algorithm. Decision trees can be used for regression and classification tasks. A decision tree is a DAG type of classifier where data is continuously split according to a certain parameter.

Terminology:

  • Node: Each object in a tree. Nodes contain subsets of data, and excluding leaf nodes, a question splits the subset.
  • Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  • Splitting: It is a process of dividing a node into two or more sub-nodes.
  • Decision Node: When a sub-node splits into further sub-nodes…

Image for post
Image for post

In Data Science, Clustering is the most common form of unsupervised learning. Clustering is a Machine Learning technique that involves the grouping of data points. Unlike Regression and Classification, we don’t have a target variable in Clustering. Since Clustering is unsupervised, we cannot calculate errors or accuracy or any of those metrics. In this blog, I will talk about different metrics to evaluate Clustering algorithms.

Clustering is evaluated based on some similarity or dissimilarity measures such as distance between cluster points. …

Jagandeep Singh

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store