Regression Logistique

Debutant 35 min 14 sections

Decouvrez la classification binaire avec la regression logistique, un algorithme fondamental du machine learning.

Telecharger CSV Telecharger Script Python

Objectifs d'apprentissage

Comprendre la difference entre regression et classification
Implementer une regression logistique avec scikit-learn
Interpreter une matrice de confusion
Evaluer avec accuracy, precision et recall

Prerequis

Module Regression Lineaire recommande

Theorie

Classification vs Regression

Contrairement a la regression qui predit une valeur continue (ex: prix), la classification predit une categorie (ex: spam/pas spam, malade/sain).

Regression logistique:

Malgre son nom, c'est un algorithme de CLASSIFICATION. Il utilise la fonction sigmoide pour convertir une sortie lineaire en probabilite.

Fonction sigmoide:

$$P(y=1) = \frac{1}{1 + e^{-z}}$$

Ou $z = a \cdot x + b$ (comme la regression lineaire)

Le resultat est toujours entre 0 et 1 (une probabilite).

Si $P > 0.5$, on predit la classe 1, sinon la classe 0.

Theorie

Schema: Classification et fonction sigmoide

Regression vs Classification:

REGRESSION	CLASSIFICATION
Sortie: valeur continue	Sortie: categorie
Prix: 250 000 EUR	Spam / Pas spam
Temperature: 23.5 C	Malade / Sain
Age: 35 ans	Fraude / Legitime
Regression Lineaire	Regression Logistique
y = ax + b	P(classe=1) = sigmoid(ax+b)

La fonction sigmoide - Convertir un score en probabilite:

La sigmoide compresse toute valeur entre 0 et 1:

Si z < 0 → P < 0.5 → Classe 0
Si z > 0 → P > 0.5 → Classe 1

Le seuil de decision est a P = 0.5 (ou z = 0).

Frontiere de decision:

La regression logistique trace une ligne (ou hyperplan) qui separe les deux classes dans l'espace des features.

Pipeline de prediction:

flowchart LR F["Features
(x1, x2)"] S["Score
z = ax + b"] Sig["Sigmoide
P = 1/(1+e^-z)"] D{"P > 0.5?"} C1["Classe 1"] C0["Classe 0"] F --> S --> Sig --> D D -->|Oui| C1 D -->|Non| C0 style F fill:#E5D7F5,color:#1A1A1A style C1 fill:#F7E64D,color:#1A1A1A style C0 fill:#E5D7F5,color:#1A1A1A

Exemple concret - Detection de spam:

Imaginons un classificateur de spam avec 2 features:

$\textcolor{#3498db}{x_1}$ = nombre de mots en majuscules (ex: $\textcolor{#3498db}{15}$)
$\textcolor{#e67e22}{x_2}$ = nombre de liens dans l'email (ex: $\textcolor{#e67e22}{8}$)

Le modele a appris: $\textcolor{#3498db}{a_1 = 0.1}$, $\textcolor{#e67e22}{a_2 = 0.2}$, $\textcolor{#9B7AC4}{b = -2.5}$

Etape 1 - Calcul du score z:

$$z = \textcolor{#3498db}{0.1 \times 15} + \textcolor{#e67e22}{0.2 \times 8} + \textcolor{#9B7AC4}{(-2.5)} = \textcolor{#3498db}{1.5} + \textcolor{#e67e22}{1.6} \textcolor{#9B7AC4}{- 2.5} = \textcolor{#27ae60}{\mathbf{0.6}}$$

Etape 2 - Conversion en probabilite P:

$$P = \frac{1}{1 + e^{-\textcolor{#27ae60}{0.6}}} = \frac{1}{1.55} = \textcolor{#e74c3c}{\mathbf{0.65}}$$

Etape 3 - Decision:

$\textcolor{#e74c3c}{0.65}$ > 0.5 → Classe 1 (SPAM)

Legende des couleurs:

$\textcolor{#3498db}{Bleu}$ : contribution des majuscules ($\textcolor{#3498db}{+1.5}$)
$\textcolor{#e67e22}{Orange}$ : contribution des liens ($\textcolor{#e67e22}{+1.6}$)
$\textcolor{#9B7AC4}{Violet}$ : biais du modele ($\textcolor{#9B7AC4}{-2.5}$)
$\textcolor{#27ae60}{Vert}$ : score final z ($\textcolor{#27ae60}{0.6}$)
$\textcolor{#e74c3c}{Rouge}$ : probabilite P ($\textcolor{#e74c3c}{65\%}$)

Avance Exercice manuel: A vous de calculer!

Objectif: Maitriser les calculs de la regression logistique a la main (sigmoide, probabilite, decision).

Prenez une feuille et un stylo. Resolvez chaque partie AVANT de regarder la solution !

CONTEXTE

Vous travaillez sur un detecteur de spam. Le modele a appris les parametres suivants :

Coefficient pour les mots en majuscules : $a_1 = 0.15$
Coefficient pour les liens dans l'email : $a_2 = 0.25$
Biais (intercept) : $b = -3.0$

Equation du modele :

$$z = a_1 \cdot x_1 + a_2 \cdot x_2 + b$$

$$P(\text{spam}) = \frac{1}{1 + e^{-z}}$$

Voici 4 emails a classifier :

Email	Mots majuscules (x1)	Liens (x2)
A	10	8
B	5	4
C	20	12
D	2	2

PARTIE 1 : Calcul des scores z

Pour chaque email, calculez le score lineaire z.

1.1) Calculez $z_A$ pour l'email A

1.2) Calculez $z_B$ pour l'email B

1.3) Calculez $z_C$ pour l'email C

1.4) Calculez $z_D$ pour l'email D

PARTIE 2 : Conversion en probabilites

Utilisez la fonction sigmoide pour convertir z en probabilite P(spam).

Rappel des valeurs utiles :

$e^0 = 1$
$e^{-1} \approx 0.368$
$e^{-2} \approx 0.135$
$e^1 \approx 2.718$
$e^2 \approx 7.389$

2.1) Calculez $P_A$ = P(spam) pour l'email A

2.2) Calculez $P_B$ = P(spam) pour l'email B

2.3) Calculez $P_C$ = P(spam) pour l'email C

2.4) Calculez $P_D$ = P(spam) pour l'email D

PARTIE 3 : Decisions de classification

Seuil de decision : si $P \geq 0.5$, l'email est classe SPAM (classe 1), sinon HAM (classe 0).

3.1) Quelle est la decision pour chaque email ?

3.2) Quels emails sont classes SPAM ?

3.3) Quels emails sont classes HAM (non-spam) ?

PARTIE 4 : Metriques de performance

Les vraies etiquettes sont :

Email	Vraie classe
A	SPAM (1)
B	HAM (0)
C	SPAM (1)
D	HAM (0)

4.1) Construisez la matrice de confusion (TP, TN, FP, FN)

4.2) Calculez l'Accuracy

4.3) Calculez la Precision (parmi les predictions SPAM, combien sont correctes ?)

4.4) Calculez le Recall (parmi les vrais SPAM, combien sont detectes ?)

PARTIE 5 : Interpretation

5.1) Quel coefficient a le plus d'impact sur la classification ? Pourquoi ?

5.2) Si un email a beaucoup de liens mais peu de majuscules, sera-t-il plus ou moins susceptible d'etre spam ?

5.3) Pour quel score z a-t-on exactement P = 50% ?

Avance Solution de l'exercice manuel

SOLUTION DETAILLEE

Prenez le temps de comparer avec vos reponses. Verifiez chaque etape !

RAPPEL DU MODELE

$$z = \textcolor{#3498db}{0.15} \cdot x_1 + \textcolor{#e67e22}{0.25} \cdot x_2 + \textcolor{#9B7AC4}{(-3.0)}$$

$$P(\text{spam}) = \frac{1}{1 + e^{-z}}$$

PARTIE 1 : Calcul des scores z

1.1) Email A ($x_1 = 10$, $x_2 = 8$) :

$$z_A = \textcolor{#3498db}{0.15 \times 10} + \textcolor{#e67e22}{0.25 \times 8} + \textcolor{#9B7AC4}{(-3.0)}$$

$$z_A = \textcolor{#3498db}{1.5} + \textcolor{#e67e22}{2.0} \textcolor{#9B7AC4}{- 3.0} = \textcolor{#27ae60}{\mathbf{0.5}}$$

1.2) Email B ($x_1 = 5$, $x_2 = 4$) :

$$z_B = \textcolor{#3498db}{0.15 \times 5} + \textcolor{#e67e22}{0.25 \times 4} + \textcolor{#9B7AC4}{(-3.0)}$$

$$z_B = \textcolor{#3498db}{0.75} + \textcolor{#e67e22}{1.0} \textcolor{#9B7AC4}{- 3.0} = \textcolor{#27ae60}{\mathbf{-1.25}}$$

1.3) Email C ($x_1 = 20$, $x_2 = 12$) :

$$z_C = \textcolor{#3498db}{0.15 \times 20} + \textcolor{#e67e22}{0.25 \times 12} + \textcolor{#9B7AC4}{(-3.0)}$$

$$z_C = \textcolor{#3498db}{3.0} + \textcolor{#e67e22}{3.0} \textcolor{#9B7AC4}{- 3.0} = \textcolor{#27ae60}{\mathbf{3.0}}$$

1.4) Email D ($x_1 = 2$, $x_2 = 2$) :

$$z_D = \textcolor{#3498db}{0.15 \times 2} + \textcolor{#e67e22}{0.25 \times 2} + \textcolor{#9B7AC4}{(-3.0)}$$

$$z_D = \textcolor{#3498db}{0.3} + \textcolor{#e67e22}{0.5} \textcolor{#9B7AC4}{- 3.0} = \textcolor{#27ae60}{\mathbf{-2.2}}$$

$\boxed{z_A = 0.5 \quad z_B = -1.25 \quad z_C = 3.0 \quad z_D = -2.2}$

PARTIE 2 : Conversion en probabilites

2.1) Email A ($z_A = 0.5$) :

$$P_A = \frac{1}{1 + e^{-0.5}} = \frac{1}{1 + 0.607} = \frac{1}{1.607} = \textcolor{#F7E64D}{\mathbf{0.622}}$$

2.2) Email B ($z_B = -1.25$) :

$$P_B = \frac{1}{1 + e^{1.25}} = \frac{1}{1 + 3.49} = \frac{1}{4.49} = \textcolor{#F7E64D}{\mathbf{0.223}}$$

2.3) Email C ($z_C = 3.0$) :

$$P_C = \frac{1}{1 + e^{-3}} = \frac{1}{1 + 0.05} = \frac{1}{1.05} = \textcolor{#F7E64D}{\mathbf{0.953}}$$

2.4) Email D ($z_D = -2.2$) :

$$P_D = \frac{1}{1 + e^{2.2}} = \frac{1}{1 + 9.03} = \frac{1}{10.03} = \textcolor{#F7E64D}{\mathbf{0.100}}$$

$\boxed{P_A = 62.2\% \quad P_B = 22.3\% \quad P_C = 95.3\% \quad P_D = 10.0\%}$

PARTIE 3 : Decisions de classification

3.1) Decisions (seuil = 0.5) :

Email A : $P_A = 62.2\% \geq 50\%$ → $\textcolor{#e74c3c}{\text{SPAM}}$
Email B : $P_B = 22.3\% < 50\%$ → $\textcolor{#27ae60}{\text{HAM}}$
Email C : $P_C = 95.3\% \geq 50\%$ → $\textcolor{#e74c3c}{\text{SPAM}}$
Email D : $P_D = 10.0\% < 50\%$ → $\textcolor{#27ae60}{\text{HAM}}$

3.2) Emails classes SPAM : A et C

3.3) Emails classes HAM : B et D

PARTIE 4 : Metriques de performance

4.1) Matrice de confusion :

Comparons predictions vs realite :

Email A : Predit SPAM, Reel SPAM → $\textcolor{#27ae60}{TP}$ (True Positive)
Email B : Predit HAM, Reel HAM → $\textcolor{#27ae60}{TN}$ (True Negative)
Email C : Predit SPAM, Reel SPAM → $\textcolor{#27ae60}{TP}$ (True Positive)
Email D : Predit HAM, Reel HAM → $\textcolor{#27ae60}{TN}$ (True Negative)

$\boxed{TP = 2 \quad TN = 2 \quad FP = 0 \quad FN = 0}$

4.2) Accuracy :

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{2 + 2}{2 + 2 + 0 + 0} = \frac{4}{4}$$

$\boxed{\text{Accuracy} = \textcolor{#27ae60}{100\%}}$

4.3) Precision :

$$\text{Precision} = \frac{TP}{TP + FP} = \frac{2}{2 + 0} = \frac{2}{2}$$

$\boxed{\text{Precision} = \textcolor{#27ae60}{100\%}}$

4.4) Recall :

$$\text{Recall} = \frac{TP}{TP + FN} = \frac{2}{2 + 0} = \frac{2}{2}$$

$\boxed{\text{Recall} = \textcolor{#27ae60}{100\%}}$

PARTIE 5 : Interpretation

5.1) Coefficient le plus impactant :

$a_1 = 0.15$ (majuscules)
$a_2 = 0.25$ (liens)

$\textcolor{#e67e22}{a_2}$ a le plus d'impact car $0.25 > 0.15$. Chaque lien supplementaire augmente davantage le score z que chaque mot en majuscule.

5.2) Beaucoup de liens, peu de majuscules :

L'email sera plus susceptible d'etre spam car $a_2 = 0.25$ a un impact plus fort. Par exemple, 0 majuscules + 12 liens donne :

$$z = 0 + 0.25 \times 12 - 3.0 = 0$$

$$P = 50\%$$

Alors que 12 majuscules + 0 liens donne :

$$z = 0.15 \times 12 + 0 - 3.0 = -1.2$$

$$P \approx 23\%$$

5.3) Score z pour P = 50% :

$$P = 0.5 = \frac{1}{1 + e^{-z}}$$

$$1 + e^{-z} = 2$$

$$e^{-z} = 1$$

$$-z = 0$$

$\boxed{z = 0}$

Quand $z = 0$, on a exactement $P = 50\%$ (point d'indecision).

RESUME DES RESULTATS

Email	z	P(spam)	Decision	Reel	Resultat
A	+0.5	62.2%	SPAM	SPAM	Correct (TP)
B	-1.25	22.3%	HAM	HAM	Correct (TN)
C	+3.0	95.3%	SPAM	SPAM	Correct (TP)
D	-2.2	10.0%	HAM	HAM	Correct (TN)

Legende des couleurs :

$\textcolor{#3498db}{Bleu}$ : contribution des majuscules ($a_1 \cdot x_1$)
$\textcolor{#e67e22}{Orange}$ : contribution des liens ($a_2 \cdot x_2$)
$\textcolor{#9B7AC4}{Violet}$ : biais du modele ($b = -3.0$)
$\textcolor{#27ae60}{Vert}$ : score z et resultats corrects
$\textcolor{#F7E64D}{Jaune}$ : probabilites calculees
$\textcolor{#e74c3c}{Rouge}$ : classification SPAM

Code

Explorer les donnees

Cliquez sur "Executer" pour voir le resultat