Reinforcement learning is an approach to machine learning that is inspired by behaviorist psychology. Reinforcement learning contrasts with other machine learning approaches in that the algorithm is not explicitly told how to perform a task, but works through the problem on its own.

Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. …

Support Vector Machine(SVM) is a supervised learning algorithm that can be used for classification and regression problems. Support Vector Machine for classification is called Support Vector Classification(SVC) and for regression is called Support Vector Regression(SVR).

SVM works on the idea of finding a hyperplane that best separates features into different domains. Let's work with an example to fully understand how SVM works. Let’s imagine we have two tags: red and blue, and our data has two features: x and y.

Before talking about Hadoop, I think we should talk about Big Data. Big data is a term used for incredibly large datasets that cannot be stored or processed efficiently with traditional methods.

Hadoop is an open-source framework used to solve big data problems efficiently. Hadoop allows distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop Ecosystem* *is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions.

In this blog, I will introduce all the components of the Hadoop ecosystem.

HDFS is the major component of the Hadoop ecosystem. HDFS allows you to store large datasets over a cluster of computers. The datasets in Big data problems are too big to store on a single computer, HDFS will store the dataset on several computers rather than one. HDFS also keeps redundant copies of the data. So, if an error occurs on one of the computers the HDFS can recover the data. …

Decision Tree is a supervised machine learning algorithm. Decision trees can be used for regression and classification tasks. A decision tree is a DAG type of classifier where data is continuously split according to a certain parameter.

**Node:**Each object in a tree. Nodes contain subsets of data, and excluding leaf nodes, a question splits the subset.**Root Node:**It represents the entire population or sample and this further gets divided into two or more homogeneous sets.**Splitting:**It is a process of dividing a node into two or more sub-nodes.**Decision Node:**When a sub-node splits into further sub-nodes, then it is called the decision node. …

In Data Science, Clustering is the most common form of unsupervised learning. Clustering is a Machine Learning technique that involves the grouping of data points. Unlike Regression and Classification, we don’t have a target variable in Clustering. Since Clustering is unsupervised, we cannot calculate errors or accuracy or any of those metrics. In this blog, I will talk about different metrics to evaluate Clustering algorithms.

Clustering is evaluated based on some similarity or dissimilarity measures such as distance between cluster points. …

Random forest is a supervised learning algorithm. It builds a forest with an ensemble of decision trees. It is an easy to use machine learning algorithm that produces a great result most of the time even without hyperparameter tuning.

In this post, I will discuss the pros and cons of using Random forest:

- Random Forests can be used for both classification and regression tasks.
- Random Forests work well with both categorical and numerical data. No scaling or transformation of variables is usually necessary.
- Random Forests implicitly perform feature selection and generate uncorrelated decision trees. It does this by choosing a random set of features to build each decision tree. …

Missing data is a well-known problem in Data Science. Missing data can cause problems in data analysis and modeling. Therefore rows with missing values need to be deleted or the missing values should be filled with reasonable values. The process of filling the missing values is called Imputation. But when dealing with time series this process is referred to as Interpolation.

In this blog, I will talk about some ways to fill missing values in Time Series.

Mean Interpolation is one of the simplest and easiest methods used to fill the missing values. …

When dealing with a classification problem(binary or multiclass) if the total number of a class of data is far less than the total number of another class of data it is called a class imbalance. For example, you may have a 2-class problem with 100 instances(rows). A total of 80 instances are labeled with class 1 and the remaining 20 instances are labeled with class 2.

Some ways to deal with class imbalance are:

Upsampling means making copies of the minority class to handle class imbalance. As described in the figure below where the orange class is a minority and the class imbalance is handled by creating multiple copies of the orange class until it is equal to the blue class. …

Overfitting refers to when a model learns the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. In other words, if your model performs really well on the training data but it performs badly on the unseen testing data that means your model is overfitting.

Overfitting in supervised machine learning algorithms happens when a model has low bias and high variance.

Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted is called the dependent variable and the factors that are used to predict the value of the dependent variable are called independent variables.

Evaluating a machine learning model is as important as building it. We are creating models to perform on new and unseen data. Hence, we need to evaluate if our model is performing correctly. Evaluating a Linear Regression model is not easy because there are a lot of evaluation metrics. …