
Overview
This project explores both supervised and unsupervised learning techniques to classify and group wine samples based on their chemical composition. It begins with a supervised model using the K-Nearest Neighbors (KNN) algorithm for classification, and later applies KMeans clustering to identify natural groupings in the data without using labels. The results of both approaches are compared in terms of effectiveness and accuracy.
Objective
Classify wine samples based on physicochemical data using KNN
Apply KMeans to group samples into clusters without using labels
Evaluate and compare both approaches
Measure model effectiveness with accuracy and silhouette score
Dataset Description
Contains 13 physicochemical features of wines, including:
Alcohol, Malic Acid, Ash, Magnesium, Total Phenols, Color Intensity, Proline, etc.
Target variable:
class
(wine type), used for KNN training and KMeans validation
Data Preprocessing
Data inspected and cleaned using
.info()
and.describe()
Standardized using
StandardScaler
to normalize all featuresDataset split into training and testing subsets for KNN
class
column removed during KMeans clustering phase
KNN Classification
A custom or imported KNN model was initialized with:
k = 17
Distance metric = Euclidean
The dataset was split into training and testing sets
Predictions were made using
.predict()
Results were stored in a new dataframe for evaluation
Evaluation Metric for KNN
Accuracy Score was computed for the model:
Indicates how many test samples were correctly classified by the KNN model
KMeans Clustering
KMeans clustering was applied on the normalized dataset with:
Number of clusters: 3 (based on known classes for comparison)
The clustering labels were generated and compared against actual class labels via visualization
PCA was used to reduce the feature space to two dimensions for visual inspection
Evaluation Metric for KMeans
Silhouette Score was calculated:
Measures how close each sample is to its own cluster vs others
Higher score = better-defined clusters
Tools and Technologies Used
Jupyter Notebook
Pandas, NumPy, Scikit-learn , Seaborn, Matplotlib
Key Insights
KNN classification performed well with Euclidean distance and 17 neighbors, confirming that the dataset is well-structured for supervised classification.
KMeans showed visually coherent clusters, although some overlap remained between predicted clusters and actual classes.
PCA proved to be a useful tool for dimensionality reduction and helped assess both supervised and unsupervised results.