Cadu Portifoil

Music Streams Analyses

Sep 2024

Analyses

Back

See on Github

Overview

This project performs an extensive exploratory data analysis (EDA) on a Spotify dataset from 2023. The dataset contains multiple musical and contextual attributes for songs (e.g., tempo, energy, danceability, valence), allowing for deep insights into trends, patterns, and distributions within the music industry. The work spans from basic cleaning to advanced descriptive statistics, correlation analysis, group comparisons, and sample-based studies.

Objective

The main goal is to explore the data in depth to:

Understand how musical attributes behave and relate to each other
Identify patterns based on release years, artists, and popularity (streams)
Prepare the dataset for future modeling or clustering
Evaluate statistical measures and their consistency using sampling techniques

Dataset Description

The dataset includes attributes such as:

name, artist, released_year, bpm, energy, danceability_%, valence_%, liveness_%, streams, and genre.
The dataset was sourced from a CSV file (spotify-2023.csv) and loaded via Google Colab.

Data Cleaning and Preparation

Key steps included:

Detection of missing values using heatmaps
Conversion of string-based numeric fields (e.g. streams) to proper numeric types
Filtering and correcting inconsistencies in musical metrics
Use of .info(), .describe() and data type conversions for structuring

Descriptive Statistical Analysis

Applied .describe() to evaluate distribution, central tendency, and spread for key variables
Plotted histograms and density curves for metrics like BPM, valence, and streams
Used violin plots, boxplots and pairplots for distribution visualization
Compared means and medians between total population and random samples (30%)

Correlation and Relationship Analysis

Calculated Pearson correlation between attributes, such as bpm and streams
Identified that some features like liveness and valence had weak correlation to popularity
Explored temporal trends in attributes using groupby('released_year')
Created scatter plots to visualize bivariate relationships

Artist and Year-Based Aggregations

Grouped data by artist and released_year to calculate average metrics
Identified the most streamed artists and the most popular release years
Generated visualizations for top-performing songs and artists using bar charts and scatter plots

Sampling and Estimation Analysis

Drew random samples (30%) to compare with full population metrics
Analyzed stability of means for variables like bpm, valence_%, and danceability_%
Plotted histograms and bivariate densities for sample subsets
Demonstrated how representative samples can approximate population behavior using mean estimation

Key Insights

Danceability and energy are among the most balanced features across genres
Streams are highly skewed and dominated by a few artists
BPM shows moderate variation but lacks strong correlation with popularity
Sampling provides good estimations for population metrics when randomness and size are controlled