#5 C4.5 Algorithm in Decision Tree

C4.5 is an improved version of the ID3 Decision Tree algorithm. It was introduced by Ross Quinlan to overcome the limitations of ID3.
This algorithm builds the decision tree using - "Entropy", "Information gain" and "gain ratio".
C4.5 helps the model:
handle continuous data
reduce overfitting using pruning
handle missing values
choose better splitting features
Entropy
Entropy measures impurity/randomness in data.
High entropy → mixed classes
Low entropy → pure classes
$$\textbf{Entropy}\ H(S) =-\sum p_i\log_2(p_i)$$
where pi = probability of class i
Information Gain
Information Gain tells us how much entropy decreases after splitting.
$$\textbf{IG}(S,F)=\textbf{H}(S)-\sum \frac{|S_v|}{|S|}\textbf{H}(S_v)$$
Problem with ID3
ID3 only uses Information Gain.
Sometimes it prefers attributes with many unique values
Example:
Student ID
Phone Number
This creates unnecessary splits.
Gain Ratio
C4.5 solves this problem using Gain Ratio.
$$\textbf{Gain Ratio} =\frac{\textbf{Information Gain}}{\textbf{Split Information}}$$
Split Information formula
$$\textbf{SplitInfo}(S,F) =-\sum \frac{|S_v|}{|S|} \log_2 \left( \frac{|S_v|}{|S|} \right)$$
The feature with the highest Gain Ratio becomes the root node.
Example with dataset
| # | Outlook | Temperature | Humidity | Wind | Play Tennis |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 3 | Overcast | Hot | High | Weak | Yes |
| 4 | Rain | Mild | High | Weak | Yes |
| 5 | Rain | Cool | Normal | Weak | Yes |
| 6 | Rain | Cool | Normal | Strong | No |
| 7 | Overcast | Cool | Normal | Strong | Yes |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 10 | Rain | Mild | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
| 12 | Overcast | Mild | High | Strong | Yes |
| 13 | Overcast | Hot | Normal | Weak | Yes |
| 14 | Rain | Mild | High | Strong | No |
| 15 | Sunny | Hot | Normal | Weak | Yes |
Total = 15 | Yes = 9 | No = 6
Step 1 — Calculate Base Entropy
$$\textbf{H}(S) =-(0.6\log_2 0.6) -(0.4\log_2 0.4) = 0.971$$
Step2 - Information Gain All Features (previously calculated in previous blog)
| Feature | Information Gain |
|---|---|
| Outlook | 0.247 |
| Temperature | 0.064 |
| Humidity | 0.078 |
| Wind | 0.020 |
Step 3 — Split Information for all the features
A. Outlook
Sunny = 6, Overcast =4, Rain = 5
$$\textbf{SplitInfo}(\text{Outlook}) =-\left( \frac{6}{15}\log_2\frac{6}{15} + \frac{4}{15}\log_2\frac{4}{15} + \frac{5}{15}\log_2\frac{5}{15} \right)$$
$$ = 1.565$$
B. Temperature
Hot = 5, Mild = 6, Cold = 4
$$\textbf{SplitInfo}(\text{Temperature}) =-\left( \frac{5}{15}\log_2\frac{5}{15} + \frac{6}{15}\log_2\frac{6}{15} + \frac{4}{15}\log_2\frac{4}{15} \right)$$
$$ = 1.565$$
C. Humidity
High = 7, Normal = 8
$$\textbf{SplitInfo}(\text{Humidity}) =-\left( \frac{7}{15}\log_2\frac{7}{15} + \frac{8}{15}\log_2\frac{8}{15} \right) = 0.997$$
D. Wind
Weak = 9, Strong = 6
$$\textbf{SplitInfo}(\text{Wind}) =-\left( \frac{9}{15}\log_2\frac{9}{15} + \frac{6}{15}\log_2\frac{6}{15} \right) = 0.971$$
Step 4 — Gain Ratio Calculation (GR)
$$\textbf{GR(Outlook)} = \frac{0.247}{1.565} =0.158$$
$$\textbf{GR(Temperature)} = \frac{0.064}{1.565} =0.041$$
$$ \textbf{GR(Humidity)} = \frac{0.078}{0.997} =0.078$$
$$ \textbf{GR(Wind)} = \frac{0.020}{0.971} =0.021$$
Final Decision
Since Outlook has Highest Gain Ratio
Root node = outlook






