Skip to main content

Command Palette

Search for a command to run...

#5 C4.5 Algorithm in Decision Tree

Updated
4 min read
#5 C4.5 Algorithm in Decision Tree
A
Machine Learning Engineer and open-source developer focused on NLP, LLM applications, Retrieval-Augmented Generation (RAG), semantic search, and AI infrastructure. I enjoy building developer tools, portable AI systems, and production-ready ML pipelines using Python, FastAPI, FAISS, LangChain, TensorFlow, and PyTorch. Creator of: • RagBucket — portable executable RAG artifacts for Python • LazyTune — fast hyperparameter optimization library • AkBOT — AI portfolio chatbot using RAG Contributor to open-source projects including NumPy and LocalStack.

C4.5 is an improved version of the ID3 Decision Tree algorithm. It was introduced by Ross Quinlan to overcome the limitations of ID3.

This algorithm builds the decision tree using - "Entropy", "Information gain" and "gain ratio".

C4.5 helps the model:

  • handle continuous data

  • reduce overfitting using pruning

  • handle missing values

  • choose better splitting features

Entropy

Entropy measures impurity/randomness in data.

  • High entropy → mixed classes

  • Low entropy → pure classes

$$\textbf{Entropy}\ H(S) =-\sum p_i\log_2(p_i)$$

where pi​ = probability of class i

Information Gain

Information Gain tells us how much entropy decreases after splitting.

$$\textbf{IG}(S,F)=\textbf{H}(S)-\sum \frac{|S_v|}{|S|}\textbf{H}(S_v)$$

Problem with ID3

ID3 only uses Information Gain.

Sometimes it prefers attributes with many unique values

Example:

  • Student ID

  • Phone Number

This creates unnecessary splits.

Gain Ratio

C4.5 solves this problem using Gain Ratio.

$$\textbf{Gain Ratio} =\frac{\textbf{Information Gain}}{\textbf{Split Information}}$$

Split Information formula

$$\textbf{SplitInfo}(S,F) =-\sum \frac{|S_v|}{|S|} \log_2 \left( \frac{|S_v|}{|S|} \right)$$

The feature with the highest Gain Ratio becomes the root node.

Example with dataset

# Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
15 Sunny Hot Normal Weak Yes

Total = 15 | Yes = 9 | No = 6

Step 1 — Calculate Base Entropy

$$\textbf{H}(S) =-(0.6\log_2 0.6) -(0.4\log_2 0.4) = 0.971$$

Step2 - Information Gain All Features (previously calculated in previous blog)

Feature Information Gain
Outlook 0.247
Temperature 0.064
Humidity 0.078
Wind 0.020

Step 3 — Split Information for all the features

A. Outlook

Sunny = 6, Overcast =4, Rain = 5

$$\textbf{SplitInfo}(\text{Outlook}) =-\left( \frac{6}{15}\log_2\frac{6}{15} + \frac{4}{15}\log_2\frac{4}{15} + \frac{5}{15}\log_2\frac{5}{15} \right)$$

$$ = 1.565$$

B. Temperature

Hot = 5, Mild = 6, Cold = 4

$$\textbf{SplitInfo}(\text{Temperature}) =-\left( \frac{5}{15}\log_2\frac{5}{15} + \frac{6}{15}\log_2\frac{6}{15} + \frac{4}{15}\log_2\frac{4}{15} \right)$$

$$ = 1.565$$

C. Humidity

High = 7, Normal = 8

$$\textbf{SplitInfo}(\text{Humidity}) =-\left( \frac{7}{15}\log_2\frac{7}{15} + \frac{8}{15}\log_2\frac{8}{15} \right) = 0.997$$

D. Wind

Weak = 9, Strong = 6

$$\textbf{SplitInfo}(\text{Wind}) =-\left( \frac{9}{15}\log_2\frac{9}{15} + \frac{6}{15}\log_2\frac{6}{15} \right) = 0.971$$

Step 4 — Gain Ratio Calculation (GR)

$$\textbf{GR(Outlook)} = \frac{0.247}{1.565} =0.158$$

$$\textbf{GR(Temperature)} = \frac{0.064}{1.565} =0.041$$

$$ \textbf{GR(Humidity)} = \frac{0.078}{0.997} =0.078$$

$$ \textbf{GR(Wind)} = \frac{0.020}{0.971} =0.021$$

Final Decision

Since Outlook has Highest Gain Ratio

Root node = outlook