Cadu Portifoil

Indoor Plants Studies

May 2025

Analyses

Clustering

Back

See on Github

Overview

This project explores and classifies plant growth conditions based on several environmental and cultivation variables. It is structured into two main stages:

Data treatment and feature engineering
Machine learning classification

The dataset was sourced from CSV files containing details on plant type, sunlight exposure, soil preferences, and watering frequency. The final goal is to predict the most suitable conditions for plant growth using classification models.

Problem Statement

How can we classify plant growth conditions based on features like sunlight, soil type, water needs, and environmental factors? This classification can aid gardeners and agriculture professionals in making data-driven decisions for optimal plant cultivation.

Dataset Description

Source: plants.csv and plants_model.csv
Key Features:
- Sunlight exposure
- Soil type
- Watering frequency
- Environmental context indicators
Type: Categorical and numeric attributes related to cultivation

Data Treatment

The data was cleaned and preprocessed using:

Handling of missing values
Standardization and formatting of categorical variables
Basic outlier detection
Encoding of string-based features into numeric types

A feature dictionary was also created to document the meaning of each column.

Feature Engineering

Derived new variables such as:

Combined experience indicators based on environmental compatibility
Scaled indexes to standardize watering and sunlight attributes

These new features were critical for improving model interpretability and performance.

Modeling and Classification

Multiple classification algorithms were tested:

Decision Tree
Random Forest
K-Nearest Neighbors
Logistic Regression

Clustering with KMeans

The KMeans algorithm was applied to identify plant condition groupings.

Process Highlights

Selection of number of clusters (k) based on the elbow method
Fitting the model to transformed features
Labeling data points according to cluster membership

Conclusions

The model successfully classifies plant condition categories with a high accuracy rate.
Feature engineering based on sunlight and soil played a key role in model performance.
The workflow is modular and easily expandable to other similar datasets.

Tools and Libraries Used

Jupyter Notebook
Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn