By Gabriel Torres Gamez in Regression — Mar 16, 2022

Simple Linear Regression using sklearn

Creating linear regression in python can be intimidating at first, but with some practise it’s easier than you would’ve imagined. First, we need to import some key libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import norm
from sklearn.linear_model import LinearRegression as lir

Now we can import some data with pandas:

data = pd.read_excel("data/grades.xlsx")
data.head()

We will be using the columns “Stunden” and “Punkte” for our example.

Now we can prepare our data and get a quick overview by plotting it:

x = data["Stunden"].values.reshape(-1, 1)
y = data["Punkte"].values.reshape(-1, 1)

plt.scatter(x, y)

Our data looks like this:

Next up we can create our model and get our model coefficient (a) and intercept (b) values:

y = a * x + b

model = lir().fit(x, y)
coef = model.coef_[0]
intercept = model.intercept_

Now we have everything to plot our model:

plt.figure(figsize=(13,8))
plt.scatter(x, y, color = "pink")
plt.plot(x, coef * x + intercept, color = "fuchsia")
plt.show()

Now that we have our model, we can determine if its a good model by verifying these 3 conditions:

The error shouldn’t be dependent
The expected value should be around 0
The residuals should follow normal distribution

We can check the first two condition by plotting the residuals into a scatter plot:

plt.scatter(x, y - (coef * x + intercept), color = "pink")
plt.plot(x, 0 * x, color = "fuchsia")
plt.show()

As we can see, the error isn’t dependent and the expected value seems to be around 0.

We can check the last condition by plotting the residuals into a histogram:

n, bins, patches = plt.hist(y - (coef * x + intercept), bins = 30, color = "pink", density = True)

mue = np.mean(y - (coef * x + intercept))
sigma = np.std(y - (coef * x + intercept))
normd = norm.pdf(bins, mue, sigma)

plt.plot(bins, normd, color = "fuchsia")

plt.show()

The residuals more or less seem to follow normal distribution.

Lastly, to score our model we can use the score function. This function calculates the value known as R2:

print("R2:", model.score(x,y))

R2: 0.5835324959871742

Judging by the 3 conditions and the R2 score we can say that this model in our case isn’t very reliable, but roughly accurate.

Bonus

To predict a value we can use the predict function of our model:

# Predicting 2 and 8 hours
model.predict([[2],[8]])

array([[33.8597135 ],
       [66.91445534]])

Interpretation: Studying 2 hours gets you around 34 points. Studying 8 hours gets you around 67 points.

Simple Linear Regression using sklearn

Bonus

Simple Logistic Regression using sklearn

Effektiv lernen ohne Ablenkung am Handy!