Naive Bayes

Debutant 35 min 13 sections

Apprenez Naive Bayes, un algorithme de classification probabiliste simple, rapide et etonnamment efficace.

Telecharger CSV Telecharger Script Python

Objectifs d'apprentissage

Comprendre le theoreme de Bayes et l'hypothese naive
Connaitre les variantes (Gaussian, Multinomial, Bernoulli)
Implementer Naive Bayes avec scikit-learn
Savoir quand utiliser cet algorithme

Prerequis

Notions de base en probabilites

Theorie

Le theoreme de Bayes

Naive Bayes est base sur le theoreme de Bayes, qui calcule la probabilite d'une classe etant donne les features.

Theoreme de Bayes:

$$P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}$$

Ou:

$P(y|X)$: Probabilite de la classe $y$ etant donne les features $X$ (ce qu'on cherche)
$P(X|y)$: Probabilite des features etant donne la classe (vraisemblance)
$P(y)$: Probabilite a priori de la classe
$P(X)$: Probabilite des features (constante pour toutes les classes)

L'hypothese "Naive":

On suppose que les features sont independantes entre elles:

$$P(X|y) = P(x_1|y) \cdot P(x_2|y) \cdot ... \cdot P(x_n|y)$$

Cette hypothese est souvent fausse en pratique, mais l'algorithme fonctionne quand meme tres bien!

Avantages:

Tres rapide (entrainement et prediction)
Fonctionne bien avec peu de donnees
Pas d'hyperparametres complexes
Interpretable (probabilites)

Theorie

Schema: Comment Naive Bayes decide

Processus de classification:

flowchart TD X["Nouvelles Features
(x1=0.5, x2=1.2)"] subgraph Calcul ["Calcul pour chaque classe"] C0["P(Classe 0) × P(x1|0) × P(x2|0)
= 0.4 × 0.6 × 0.3 = 0.072"] C1["P(Classe 1) × P(x1|1) × P(x2|1)
= 0.6 × 0.8 × 0.7 = 0.336"] end X --> C0 X --> C1 C0 --> D{"Comparer"} C1 --> D D --> R["Prediction: Classe 1
(plus probable)"] style R fill:#F7E64D,color:#1A1A1A

Les 3 variantes de Naive Bayes:

flowchart LR NB["Naive Bayes"] G["GaussianNB
Features continues
(distribution normale)"] M["MultinomialNB
Comptages
(ex: texte, TF-IDF)"] B["BernoulliNB
Features binaires
(presence/absence)"] NB --> G NB --> M NB --> B style G fill:#9B7AC4,color:#FFFFFF style M fill:#C09CF0,color:#1A1A1A style B fill:#E5D7F5,color:#1A1A1A

Exemple concret - Classification d'un email (spam ou non):

Nouvel email avec features: $\textcolor{#3498db}{x_1 = 0.8}$ (mots suspects), $\textcolor{#e67e22}{x_2 = 0.6}$ (liens)

Le modele a appris:

$\textcolor{#9B7AC4}{P(\text{Spam}) = 0.4}$, $\textcolor{#9B7AC4}{P(\text{Normal}) = 0.6}$
$\textcolor{#3498db}{P(x_1|\text{Spam}) = 0.9}$, $\textcolor{#3498db}{P(x_1|\text{Normal}) = 0.2}$
$\textcolor{#e67e22}{P(x_2|\text{Spam}) = 0.8}$, $\textcolor{#e67e22}{P(x_2|\text{Normal}) = 0.3}$

Calcul pour chaque classe:

$$P(\text{Spam}|X) \propto \textcolor{#9B7AC4}{0.4} \times \textcolor{#3498db}{0.9} \times \textcolor{#e67e22}{0.8} = \textcolor{#e74c3c}{\mathbf{0.288}}$$

$$P(\text{Normal}|X) \propto \textcolor{#9B7AC4}{0.6} \times \textcolor{#3498db}{0.2} \times \textcolor{#e67e22}{0.3} = \textcolor{#27ae60}{\mathbf{0.036}}$$

Decision: $\textcolor{#e74c3c}{0.288}$ > $\textcolor{#27ae60}{0.036}$ → SPAM

Legende des couleurs:

$\textcolor{#9B7AC4}{Violet}$ : probabilites a priori P(classe)
$\textcolor{#3498db}{Bleu}$ : vraisemblance de $x_1$ (mots suspects)
$\textcolor{#e67e22}{Orange}$ : vraisemblance de $x_2$ (liens)
$\textcolor{#e74c3c}{Rouge}$ / $\textcolor{#27ae60}{Vert}$ : scores finaux

Interpretation: L'email a beaucoup de mots suspects ($\textcolor{#3498db}{0.9}$ vs $\textcolor{#3498db}{0.2}$) et de liens ($\textcolor{#e67e22}{0.8}$ vs $\textcolor{#e67e22}{0.3}$), ce qui le rend 8x plus probable d'etre du spam!

Avance Exercice manuel: A vous de calculer!

Objectif: Appliquer le theoreme de Bayes a la main pour la classification.

CONTEXTE

Classification de spam avec Naive Bayes. On observe le mot "gratuit" dans un email.

Statistiques du corpus :

P(SPAM) = 0.3 (30% des emails sont des spams)
P(HAM) = 0.7
P("gratuit" | SPAM) = 0.8 (80% des spams contiennent "gratuit")
P("gratuit" | HAM) = 0.1

Theoreme de Bayes :

$$P(SPAM|gratuit) = \frac{P(gratuit|SPAM) \cdot P(SPAM)}{P(gratuit)}$$

PARTIE 1 : Probabilite totale

1.1) Calculez P("gratuit") = P("gratuit"|SPAM)P(SPAM) + P("gratuit"|HAM)P(HAM)

PARTIE 2 : Application de Bayes

2.1) Calculez P(SPAM | "gratuit")

2.2) Calculez P(HAM | "gratuit")

2.3) Quelle est la classification ?

PARTIE 3 : Avec un 2eme mot

L'email contient aussi "urgent". P("urgent"|SPAM) = 0.6, P("urgent"|HAM) = 0.2

3.1) Calculez P(SPAM | "gratuit", "urgent") (hypothese Naive : independance)

3.2) Comment le 2eme mot affecte-t-il la confiance ?

Avance Solution de l'exercice manuel

SOLUTION DETAILLEE

PARTIE 1 : Probabilite totale

1.1) P("gratuit") :

$$P(gratuit) = 0.8 \times 0.3 + 0.1 \times 0.7$$

$$= 0.24 + 0.07 = \textcolor{#9B7AC4}{\mathbf{0.31}}$$

PARTIE 2 : Application de Bayes

2.1) P(SPAM | "gratuit") :

$$P(SPAM|gratuit) = \frac{0.8 \times 0.3}{0.31} = \frac{0.24}{0.31} = \textcolor{#e74c3c}{\mathbf{0.774}}$$

2.2) P(HAM | "gratuit") :

$$P(HAM|gratuit) = \frac{0.1 \times 0.7}{0.31} = \frac{0.07}{0.31} = \textcolor{#27ae60}{\mathbf{0.226}}$$

Verification : $0.774 + 0.226 = 1$ ✓

2.3) Classification :

$\boxed{\text{SPAM avec 77.4\% de confiance}}$

PARTIE 3 : Avec un 2eme mot

3.1) Avec "urgent" (hypothese Naive) :

Numerateur SPAM : $P(g|S) \cdot P(u|S) \cdot P(S) = 0.8 \times 0.6 \times 0.3 = 0.144$

Numerateur HAM : $P(g|H) \cdot P(u|H) \cdot P(H) = 0.1 \times 0.2 \times 0.7 = 0.014$

Total : $0.144 + 0.014 = 0.158$

$$P(SPAM|g,u) = \frac{0.144}{0.158} = \textcolor{#e74c3c}{\mathbf{0.911}}$$

3.2) Le 2eme mot augmente la confiance de 77.4% a 91.1%.

$\boxed{\text{Chaque mot discriminant renforce la classification}}$

Legende : $\textcolor{#e74c3c}{Rouge}$: SPAM, $\textcolor{#27ae60}{Vert}$: HAM, $\textcolor{#9B7AC4}{Violet}$: probabilites totales

Code

Explorer les donnees

Cliquez sur "Executer" pour voir le resultat