Cadu Portifoil

Wine Classification

Sep 2024

Analyses

Classification

Back

See on Github

Overview

This project explores both supervised and unsupervised learning techniques to classify and group wine samples based on their chemical composition. It begins with a supervised model using the K-Nearest Neighbors (KNN) algorithm for classification, and later applies KMeans clustering to identify natural groupings in the data without using labels. The results of both approaches are compared in terms of effectiveness and accuracy.

Objective

Classify wine samples based on physicochemical data using KNN
Apply KMeans to group samples into clusters without using labels
Evaluate and compare both approaches
Measure model effectiveness with accuracy and silhouette score

Dataset Description

Contains 13 physicochemical features of wines, including:
- Alcohol, Malic Acid, Ash, Magnesium, Total Phenols, Color Intensity, Proline, etc.
Target variable: class (wine type), used for KNN training and KMeans validation

Data Preprocessing

Data inspected and cleaned using .info() and .describe()
Standardized using StandardScaler to normalize all features
Dataset split into training and testing subsets for KNN
class column removed during KMeans clustering phase

KNN Classification

A custom or imported KNN model was initialized with:
- k = 17
- Distance metric = Euclidean
The dataset was split into training and testing sets
Predictions were made using .predict()
Results were stored in a new dataframe for evaluation

Evaluation Metric for KNN

Accuracy Score was computed for the model:
- Indicates how many test samples were correctly classified by the KNN model

KMeans Clustering

KMeans clustering was applied on the normalized dataset with:
- Number of clusters: 3 (based on known classes for comparison)
The clustering labels were generated and compared against actual class labels via visualization
PCA was used to reduce the feature space to two dimensions for visual inspection

Evaluation Metric for KMeans

Silhouette Score was calculated:
- Measures how close each sample is to its own cluster vs others
- Higher score = better-defined clusters

Tools and Technologies Used

Jupyter Notebook
Pandas, NumPy, Scikit-learn , Seaborn, Matplotlib

Key Insights

KNN classification performed well with Euclidean distance and 17 neighbors, confirming that the dataset is well-structured for supervised classification.
KMeans showed visually coherent clusters, although some overlap remained between predicted clusters and actual classes.
PCA proved to be a useful tool for dimensionality reduction and helped assess both supervised and unsupervised results.