Wine Classification

Wine Classification

Sep 2024

Sep 2024

Analyses

Analyses

Analyses

Classification

Classification

Classification

Overview

This project explores both supervised and unsupervised learning techniques to classify and group wine samples based on their chemical composition. It begins with a supervised model using the K-Nearest Neighbors (KNN) algorithm for classification, and later applies KMeans clustering to identify natural groupings in the data without using labels. The results of both approaches are compared in terms of effectiveness and accuracy.

Objective

  • Classify wine samples based on physicochemical data using KNN

  • Apply KMeans to group samples into clusters without using labels

  • Evaluate and compare both approaches

  • Measure model effectiveness with accuracy and silhouette score

Dataset Description

  • Contains 13 physicochemical features of wines, including:

    • Alcohol, Malic Acid, Ash, Magnesium, Total Phenols, Color Intensity, Proline, etc.

  • Target variable: class (wine type), used for KNN training and KMeans validation

Data Preprocessing

  • Data inspected and cleaned using .info() and .describe()

  • Standardized using StandardScaler to normalize all features

  • Dataset split into training and testing subsets for KNN

  • class column removed during KMeans clustering phase

KNN Classification

  • A custom or imported KNN model was initialized with:

    • k = 17

    • Distance metric = Euclidean

  • The dataset was split into training and testing sets

  • Predictions were made using .predict()

  • Results were stored in a new dataframe for evaluation

Evaluation Metric for KNN

  • Accuracy Score was computed for the model:

    • Indicates how many test samples were correctly classified by the KNN model

KMeans Clustering

  • KMeans clustering was applied on the normalized dataset with:

    • Number of clusters: 3 (based on known classes for comparison)

  • The clustering labels were generated and compared against actual class labels via visualization

  • PCA was used to reduce the feature space to two dimensions for visual inspection

Evaluation Metric for KMeans

  • Silhouette Score was calculated:

    • Measures how close each sample is to its own cluster vs others

    • Higher score = better-defined clusters

Tools and Technologies Used

  • Jupyter Notebook

  • Pandas, NumPy, Scikit-learn , Seaborn, Matplotlib

Key Insights

  • KNN classification performed well with Euclidean distance and 17 neighbors, confirming that the dataset is well-structured for supervised classification.

  • KMeans showed visually coherent clusters, although some overlap remained between predicted clusters and actual classes.

  • PCA proved to be a useful tool for dimensionality reduction and helped assess both supervised and unsupervised results.

Visualizations