
Introduction and Objective
This project demonstrates the creation and validation of a machine learning model designed to automate the classification of financial expenses. The primary objective was to build a system that can accurately assign a category to a transaction based on its textual description. The model was trained on a dataset of expenses pre-classified into 13 distinct categories, including "Aluguel" (Rent), "Saúde" (Health), "Educação" (Education), and "Transporte" (Transportation). The end goal is a robust and reliable tool that can process new, unclassified expense sheets, saving time and improving the accuracy of financial tracking.
Methodology
The project followed a systematic workflow, from data preparation to model deployment.
Data Preprocessing and Cleaning: The initial dataset was loaded from a CSV file using Pandas. A dedicated Python function (
tratar_dados) was created to standardize the expense descriptions. This function converted all text to lowercase and used regular expressions (re) to remove punctuation, numerical digits, and extraneous whitespace, ensuring a clean and consistent text format for analysis.Feature Engineering and Encoding: The cleaned textual descriptions served as the primary feature. These were transformed into a numerical format using Scikit-learn's
TfidfVectorizer, which calculates the relative importance of each word in the descriptions. The categorical labels (e.g., "Aluguel") were converted into numerical codes (0-12) usingLabelEncoderto make them suitable for model training.Model Training: A
RandomForestClassifier, a powerful ensemble learning method, was selected for this classification task. The model was configured with 500 estimators (n_estimators=500) to ensure high accuracy. The dataset was split into an 80% training set and a 20% testing set to train and then evaluate the model.Model Validation: The model's performance was rigorously validated using two methods:
Hold-out Validation: An initial evaluation was performed on the 20% test set, which the model had not seen during training.
Stratified K-Fold Cross-Validation: To ensure the model's stability and prevent any bias from the initial data split, a 10-fold stratified cross-validation was conducted. This method provides a more reliable estimate of the model's performance on new data by training and testing it on 10 different subsets of the data.
Model Persistence: The finalized, trained
RandomForestClassifierand theTfidfVectorizerwere serialized and saved into.pklfiles usingjoblib. This allows the model to be easily loaded and used for future predictions without the need for retraining.
Tools and Technologies
Language: Python
Libraries:
Pandas: For data loading, manipulation, and analysis.
Scikit-learn: For implementing the machine learning pipeline, including
RandomForestClassifier,TfidfVectorizer,train_test_split, and evaluation metrics (classification_report,accuracy_score,StratifiedKFold).Re (Regular Expressions): For advanced text cleaning and standardization.
Joblib: For saving and loading the trained model and vectorizer.
Results and Analysis
The model demonstrated excellent predictive performance and robustness.
Test Accuracy: On the initial hold-out test set, the model achieved an accuracy of 97.47%.
Classification Report: The detailed report showed outstanding
precision,recall, andf1-scoremetrics across all expense categories, confirming that the model is highly effective at distinguishing between different classes.Cross-Validation Score: The 10-fold stratified cross-validation yielded a mean accuracy of 98.99% with a very low standard deviation of 0.0095. This result confirms that the model is not only highly accurate but also very stable and reliable, performing consistently across different data samples.
Conclusion
This project successfully developed an automated, high-accuracy expense classification system. The chosen RandomForestClassifier proved to be exceptionally well-suited for this text-based classification task. The final, persistent model is capable of processing new datasets and accurately assigning expense categories, providing a practical and efficient solution for financial management. The project covers the end-to-end machine learning lifecycle, from data cleaning and feature engineering to rigorous validation and model deployment.
