Simple Logistic Regression using sklearn

Simple Logistic Regression using sklearn

Creating logistic regression in python can be intimidating at first, but with some practise it’s easier than you would’ve imagined. First, we need to import some key libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.linear_model import LogisticRegression as lor
from sklearn.model_selection import train_test_split as sp

Now we can import some data with pandas:

data = pd.read_excel("data/grades.xlsx")
data.head()

Next thing we need to do is to change all our categorial variables to numeric values for later use in logistic regression:

data["Bestande"] = data["Bestande"].map({"WHACK":0,"Durchschnitt":1})

Now we can split our data into test and training data and get a quick overview of our data:

train, test = sp(data, test_size = .3)

x = data["Stunden"].values.reshape(-1, 1)
y = data["Bestande"].values

x_train = train["Stunden"].values.reshape(-1, 1)
y_train = train["Bestande"].values

x_test = test["Stunden"].values.reshape(-1, 1)
y_test = test["Bestande"].values

plt.scatter(x, y)

Our data looks like this:

Next up we can create our model with the training data,

model = lor().fit(x_train, y_train)

create predictions

x_predict = np.linspace(0, 12, 20).reshape(-1, 1)
y_predict = model.predict_proba(x_predict)[:,1]

and plot our regression:

plt.figure(figsize=(13,8))
plt.scatter(x, y, color = "pink")
plt.plot(x_predict, y_predict, color = "fuchsia")
plt.show()

With our test data we can score our model. There are 2 ways to do this. The first one is with a confusion matrix:

y_predict_score = model.predict(x_test)

confm = metrics.confusion_matrix(y_test, y_predict_score)
print("confusion matrix:", "\n", confm, "\n")

score = (confm[0][0] + confm[1][1]) / np.sum(confm)
print("score:", score)
confusion matrix: 
 [[21  8]
 [ 4 31]] 

score: 0.8125

The second one is with the score function:

print("score:", model.score(x_test, y_test))
score: 0.8125

Bonus

Like the LinearRegression model, the LogisticRegression model returns coefficient (a) and intercept (b) values.

a = model.coef_
b = model.intercept_

We can use these values to calculate the probability with this mathematic formula:

To predict the probability of a value we can also use the predict_proba function of our model:

# Predicting 2 and 8 hours
model.predict_proba([[2],[8]])
array([[0.81942292, 0.18057708],
       [0.17493011, 0.82506989]])

Interpretation: Studying 2 hours means you have a 82% chance of not approving (y = 0) and a 18% chance of approving (y = 1). Studying 8 hours means you have a 17% chance of not approving and a 83% chance of approving.