Course Outline

Introduction to Supervised Learning

Regression in Machine Learning

Classification in Data Science

Regression in Machine Learning

Last Updated: 3rd November, 2024

Regression is a prescient modelling procedure utilized in machine learning. It is utilized to foresee a continuous value, such as a cost or a probability, from a given set of independent variables. It is a supervised learning algorithm, meaning that it requires labelled training data to create exact models. Regression algorithms can be linear or nonlinear and can be utilized for both classification and regression errands. Regression can be used to distinguish patterns in information, reveal connections between factors, and make expectations almost long haul.

What is Regression in Machine Learning?

Regression in Machine Learning is a procedure utilized to foresee the output of a given input. It could be a supervised learning algorithm, meaning it is prepared utilizing labelled data.

An illustration of regression within the industry is anticipating the cost of a house. In this situation, we would utilize regression to prepare a machine learning model utilizing labelled data of house costs and their related characteristics such as square footage, number of rooms, number of lavatories, area, etc. Once the machine learning model is trained, we can then input new characteristics of a house and the model will predict the associated price of the house. This can be used by real estate agents to help set prices for their clients.

Regression has also been used by companies to predict the demand for their products. By training a machine learning model with labelled data of sales and associated characteristics such as advertising spend, seasonality, etc., companies can predict how much demand there will be for their products. This can help them better manage their inventory and set prices accordingly.

Regression in machine learning is a process of predicting a continuous or real value output, such as stock prices, house prices or GDP growth, based on independent variables or features. A supervised learning problem involves finding a function that best maps the relationship between the input features and the output variable.

Formula for Regression

The most basic form of a regression model is Linear Regression, where the relationship between the dependent variable (YYY) and one or more independent variables (X1,X2,...,Xn) is represented by a linear equation:

Y=β0+β1X1+β2X2+...+βnXn+ϵ

where:

Y = predicted or dependent variable
X1,X2,...,X = independent variables (predictors)
β0 = intercept term, representing the value of YYY when all XXX variables are zero
β1,β2,...,βn = coefficients representing the relationship strength of each predictor with YYY
ϵ = error term, accounting for the variability in YYY not explained by XXX

Definition and Purpose of Regression

Regression is a statistical analysis technique used to determine the relationships between a dependent variable and one or more independent variables. It is used to analyze the effects of multiple variables on a single outcome variable. It is commonly used in forecasting, forecasting financial markets, and determining the cause of a particular phenomenon. Regression can help identify trends, relationships, and patterns that can provide insight into the data and its underlying structure.

Key Terms in Regression Analysis

Dependent Variable (Target Variable): The outcome or variable that the model is trying to predict. It is often denoted as “Y” in equations. In a regression problem, this variable is continuous.

Independent Variables (Predictors/Features): The variables used to predict the dependent variable. They are denoted as “X” and can be continuous or categorical. Multiple independent variables can influence the target variable.

Coefficient: A value that represents the relationship strength between an independent variable and the dependent variable in the model. In linear regression, for example, the coefficient indicates how much the dependent variable changes with a one-unit change in an independent variable.

Intercept: The value of the dependent variable when all independent variables are zero. In a regression line equation, the intercept is the point where the line crosses the Y-axis.

Residuals (Errors): The difference between observed and predicted values of the dependent variable. Residuals represent the error in the model's predictions, with smaller residuals indicating better accuracy.

R-squared (Coefficient of Determination): A statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables. R-squared values range from 0 to 1, with higher values indicating a better fit.

Overfitting and Underfitting: Overfitting occurs when a model learns noise in the training data, leading to poor generalization on new data. Underfitting happens when the model fails to capture the underlying trend, leading to poor performance both on training and unseen data.

Characteristics of Regression in Machine Learning

Predictive Analysis: Regression models focus on predicting continuous outcomes based on one or more independent variables, making them ideal for applications like forecasting prices or assessing trends.

Linearity: Most traditional regression models assume a linear relationship between the dependent and independent variables, although certain models (e.g., polynomial regression) can capture non-linear relationships.

Deterministic Relationships: Regression seeks to establish deterministic relationships between variables, where changes in predictors cause specific effects on the target variable. This is useful for identifying patterns and dependencies.

Interpretable Results: Regression models, especially linear regression, are highly interpretable and allow us to understand how changes in each feature impact the target. This interpretability makes them valuable for decision-making in fields such as finance and healthcare.

Quantitative Assessment of Relationships: By providing a mathematical representation, regression allows a quantitative assessment of the strength and nature of relationships between variables, often represented by coefficients and R-squared values.

Assumptions of Regression in Machine Learning

Linearity of the Model: Regression assumes a linear relationship between the independent and dependent variables. This means the change in the target variable is directly proportional to changes in the predictors.

Independence of Errors: The residuals (errors) should be independent of each other, meaning that the error in one observation should not correlate with the error in another. This is particularly important in time-series data to avoid autocorrelation.

Homoscedasticity: Homoscedasticity implies that the variance of errors is consistent across all levels of the independent variables. When this assumption is violated (heteroscedasticity), it can indicate problems like data variability that could distort the model.

Normality of Errors: Regression assumes that residuals follow a normal distribution, especially for smaller datasets. This is essential for hypothesis testing and constructing confidence intervals.

No Multicollinearity: In multiple regression, independent variables should not be highly correlated with each other. High multicollinearity (correlation between predictors) can lead to instability in coefficient estimates and reduce the interpretability of the model.

Types of Regression in Machine Learning

Multiple Linear Regression: This sort of regression employs different independent variables to foresee the esteem of one dependent variable.

Polynomial Regression: This sort of regression is utilized to model nonlinear relationships between the independent and dependent factors.

Logistic Regression: This type of regression is used to predict a binary (yes/no) outcome based on one or more independent variables.

Ridge Regression: This type of regression is utilized to diminish the complexity of a show and avoid overfitting.

Lasso Regression: This sort of regression is utilized to decrease the complexity of a demonstration and progress its exactness.

Dataset Structure of Regression Model

The structure of a regression model dataset typically includes the following columns:

A target column, which contains the outcome or dependent variable that the model is attempting to predict.
An ID column, which contains a unique identifier for each observation in the dataset.
A set of feature columns, which contain the independent variables that the model uses to make predictions.
A timestamp column, which contains the time at which each observation was recorded.

Regression Model Example in Machine Learning

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing # Importing fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt


# Load the California Housing Dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target  # Target variable (house prices)




X = df.drop('PRICE', axis=1)  # Independent variables
y = df['PRICE']               # Dependent variable


# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)


# Predict the prices for the test set
y_pred = model.predict(X_test)


# Calculate MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")




# Scatter plot of Actual vs Predicted prices
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color="blue", label="Predicted vs Actual")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="red", linestyle="--", label="Ideal fit")
plt.xlabel("Actual House Prices")
plt.ylabel("Predicted House Prices")
plt.title("Actual vs Predicted House Prices")
plt.legend()
plt.show()

# Output
Mean Squared Error (MSE): 0.5558915986952444
R-squared (R²): 0.5757877060324508

Output:

Linear Regression Model Output

In this example, the linear regression model learns to predict PRICE using other housing attributes. The formula for prediction is:

PRICE=β0+β1×CRIM+β2×ZN+...+βn×LSTAT+ϵ

where:

β0 is the intercept.
β1,β2,...,βn are coefficients for each feature (e.g., CRIM, ZN, etc.).

The coefficients learned by the model indicate the relationship strength between each predictor and the target variable, allowing us to interpret how each feature impacts house prices.

Regression Evaluation Metrics

Mean Absolute Error (MAE)

The average of the absolute differences between the predicted and actual values. MAE provides a straightforward measure of prediction accuracy, unaffected by large errors. However, it doesn’t indicate whether errors are positive or negative

Formula:

MAE = (1/n) * Σ |y_i - ŷ_i|

where:

y_i = actual value
ŷ_i = predicted value
n = total number of observations

Mean Squared Error (MSE)

The average of the squared differences between predicted and actual values. By squaring the errors, MSE penalizes large errors more heavily, making it useful when large deviations from actual values are particularly undesirable.

Formula:

MSE = (1/n) * Σ (y_i - ŷ_i)^2

where:

y_i = actual value
ŷ_i = predicted value
n = total number of observations

Root Mean Squared Error (RMSE)

The square root of MSE, providing an error metric in the same units as the dependent variable. RMSE emphasizes large errors due to the squaring in MSE and is commonly used in many applications due to its interpretability.

Formula:

RMSE = √((1/n) * Σ (y_i - ŷ_i)^2)

where:

y_i = actual value
ŷ_i = predicted value
n = total number of observations

R-squared (R²)

Indicates the proportion of the variance in the dependent variable explained by the independent variables. R² values close to 1 suggest a strong fit, while values close to 0 indicate a weak model. However, R² does not account for model complexity and can be artificially high in overfitted models.

Formula:

R² = 1 - (Σ (y_i - ŷ_i)^2 / Σ (y_i - ȳ)^2)

where:

y_i = actual value
ŷ_i = predicted value
ȳ = mean of actual values
n = total number of observations

Adjusted R-squared

An adjusted version of R² that penalizes for the number of predictors, preventing overestimation of model fit when unnecessary features are added. It is more reliable than R² when comparing models with different numbers of predictors.

Formula:

Adjusted R² = 1 - ((1 - R²) * (n - 1) / (n - k - 1))

where:

R² = R-squared value
n = total number of observations
k = number of predictors

Mean Absolute Percentage Error (MAPE)

Measures the average percentage error between predicted and actual values, often used when dealing with relative comparisons across datasets of different scales. It is useful for interpretability in business and finance, where percentage errors are more meaningful.

Formula:

MAPE = (1/n) * Σ |(y_i - ŷ_i) / y_i| * 100

where:

y_i = actual value
ŷ_i = predicted value
n = total number of observations

Applications of Regression in Machine Learning

Financial matters: Regression is utilized to analyze financial information and recognize designs and relationships between diverse factors. For case, financial analysts might utilize relapse examination to investigate the relationship between GDP and work or the affect of charges on customer investing.
Psychology: Regression is utilized to analyze information from mental tests and superior get it how certain factors connected with one another. For illustration, analysts might utilize relapse to investigate the connections between IQ, instructive achievement, and work execution.
Public Health: Relapse is utilized to get it the connections between distinctive wellbeing results and chance variables. For illustration, disease transmission experts might utilize relapse to analyze the relationship between corpulence and heart infection or smoking and lung cancer.

Advantages of Regression

Helps in Predictive Analysis: Regression analysis is useful for predictive analysis as it helps in predicting the value of the dependent variable based on the values of the independent variables.
Helps in Identifying the Relationship: Regression analysis helps in identifying the nature and strength of the relationship between the dependent and independent variables.
Useful in Decision Making: Regression analysis is useful in decision making as it provides a quantitative assessment of the relationship between variables.
Helps in Finding the Best Fit: Regression analysis helps in finding the best fit between the independent and dependent variables, which can help in understanding the underlying mechanisms of the relationship.

Disadvantages of Regression

Sensitive to Outliers: Regression analysis is sensitive to outliers, which can affect the results of the analysis.
Linearity Assumption: Regression analysis assumes a linear relationship between the dependent and independent variables. If this assumption is violated, the results of the analysis may not be accurate.
Overfitting: Regression analysis can suffer from overfitting, which occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data.
Limited to Continuous Variables: Regression analysis is limited to continuous variables, and it may not be suitable for analyzing categorical or binary data.

Regression Algorithms in Machine Learning

Linear Regression: Linear regression in machine learning is one of the most commonly used algorithms for regression problems. It is used to estimate the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
Logistic Regression: Logistic regression in machine learning is a method used to fit a regression model when the dependent variable is binary or ordinal. It is used to predict the probability of an event occurring, such as the probability of a person being diagnosed with a disease or the probability of a person buying a product.
Polynomial Regression: Polynomial regression in machine learning is a type of regression analysis in which a polynomial function is used to fit a given set of data points. It is used to model non-linear relationships between the independent and dependent variables.
Decision Tree: A decision tree is a supervised learning algorithm that can be used for both regression and classification problems. It is a decision tree-based model that builds a regression model in the form of a tree structure. It splits the data into subsets based on the most significant independent variables.
Support Vector Machine: A support Vector is a type of Support Vector Machine (SVM) that is used for regression problems. It is based on the principle of finding a hyperplane that best separates a set of data points. The separating hyperplane is chosen in such a way that the distance between the data points and the hyperplane is as large as possible.
Random Forest: Random forest is an ensemble learning method that combines multiple decision tree models to create a more powerful model. It is a supervised learning algorithm that uses multiple decision trees to create an aggregate model that is more accurate than any of the individual decision trees.

Conclusion

After utilizing regression within the industry, companies are presently able to foresee the cost of a house based on its characteristics, as well as anticipate the request for their items based on related characteristics such as promoting spend and regularity. This has permitted them to superior oversee their stock and set costs in like manner.

Key Takeaways

Regression is a supervised learning technique used to predict a continuous numerical outcome.
It is based on the relationship between the independent and dependent variables.
Linear regression is the most common type of regression and is used to model linear relationships between a dependent variable and one or more independent variables.
Regularization techniques such as L1, L2, and Elastic Net can be used to improve the performance of linear regression models.
Nonlinear regression models such as polynomial regression and support vector regression can be used to model nonlinear relationships between the independent and dependent variables.
Evaluating the performance of a regression model is important to ensure that it is able to accurately predict the desired outcome.
Cross-validation is a common method used to evaluate the performance of regression models.
Feature selection and engineering can be used to improve the performance of regression models by reducing the number of input features and transforming the data.

Quiz

What is the most popular method of evaluating the accuracy of a regression model?
1. Root Mean Square Error
2. Mean Absolute Error \R
3. squared
4. Adjusted R-squared

Answer: c. R-squared

What is the goal of linear regression?
1. To minimize the data points
2. To minimize the error
3. To maximize the error
4. To maximize the correlation between the independent and dependent variables

Answer: b. To minimize the error

What type of supervised learning problem is linear regression?
1. Classification
2. Clustering
3. Regression
4. Dimensionality Reduction

Answer: c. Regression

What is the most common form of regularization used in linear regression?
1. L1 regularization
2. L2 regularization
3. Dropout
4. Early stopping

Answer: b. L2 regularization

Module 3: Supervised Learning