Cadu Portifoil

Expenses Classification

Nov 2024

Analyses

Classification

Back

See on Github

Overview

This project focuses on preparing and organizing a dataset of financial transactions to enable classification based on the transaction descriptions. It involves cleaning textual data, encoding categorical information, and structuring the dataset for use in machine learning models.

Problem Statement

The project aims to classify financial transactions into categories by analyzing the content of their description field. This approach is useful for automating financial organization and identifying spending patterns without manual labeling.

Dataset Description

The dataset contains various fields related to financial records, including a description of each transaction and its assigned category.
The primary feature used for classification is the transaction description.
The target variable is the transaction category.

Data Preprocessing

A preprocessing step was applied to standardize the transaction descriptions:

All text was formatted uniformly (e.g., normalization and cleaning).
The goal was to make the descriptions more consistent and suitable for further text analysis.

Additionally, the original dataset was saved in a new, cleaned format for reuse.

Feature Engineering

The category labels were converted into numeric codes to prepare the data for machine learning algorithms.
Irrelevant columns such as document numbers and raw credit/debit values were removed to focus the model on textual features.

Dataset Preparation for Modeling

The cleaned dataset was split into training and testing sets, enabling the construction and evaluation of a future classification model.
This preparation stage indicates the project is ready for the implementation of supervised learning using the structured text inputs.

Model Used

The classification model applied in this project is a Random Forest classifier. This ensemble method is effective for handling high-dimensional data and provides robustness against overfitting. It was trained using the cleaned and preprocessed dataset to learn patterns in transaction descriptions that correlate with predefined categories.

The dataset was split into training and test sets to evaluate the model’s performance and ensure generalization.

Evaluation Metrics

The model was tested on 158 transactions and showed excellent classification performance:

Accuracy: 0.97
Precision (macro avg): 0.99
Recall (macro avg): 0.96
F1-score (macro avg): 0.97

Tools and Technologies Used

Jupyter Notebook
Numpy, Pandas, Scikit-Learn

Output and Artifacts

A preprocessed CSV file containing normalized text and encoded category labels, ready for modeling