
Overview
This project performs an extensive exploratory data analysis (EDA) on a Spotify dataset from 2023. The dataset contains multiple musical and contextual attributes for songs (e.g., tempo, energy, danceability, valence), allowing for deep insights into trends, patterns, and distributions within the music industry. The work spans from basic cleaning to advanced descriptive statistics, correlation analysis, group comparisons, and sample-based studies.
Objective
The main goal is to explore the data in depth to:
Understand how musical attributes behave and relate to each other
Identify patterns based on release years, artists, and popularity (streams)
Prepare the dataset for future modeling or clustering
Evaluate statistical measures and their consistency using sampling techniques
Dataset Description
The dataset includes attributes such as:
name
,artist
,released_year
,bpm
,energy
,danceability_%
,valence_%
,liveness_%
,streams
, andgenre
.The dataset was sourced from a CSV file (
spotify-2023.csv
) and loaded via Google Colab.
Data Cleaning and Preparation
Key steps included:
Detection of missing values using heatmaps
Conversion of string-based numeric fields (e.g. streams) to proper numeric types
Filtering and correcting inconsistencies in musical metrics
Use of
.info()
,.describe()
and data type conversions for structuring
Descriptive Statistical Analysis
Applied
.describe()
to evaluate distribution, central tendency, and spread for key variablesPlotted histograms and density curves for metrics like BPM, valence, and streams
Used violin plots, boxplots and pairplots for distribution visualization
Compared means and medians between total population and random samples (30%)
Correlation and Relationship Analysis
Calculated Pearson correlation between attributes, such as
bpm
andstreams
Identified that some features like
liveness
andvalence
had weak correlation to popularityExplored temporal trends in attributes using
groupby('released_year')
Created scatter plots to visualize bivariate relationships
Artist and Year-Based Aggregations
Grouped data by
artist
andreleased_year
to calculate average metricsIdentified the most streamed artists and the most popular release years
Generated visualizations for top-performing songs and artists using bar charts and scatter plots
Sampling and Estimation Analysis
Drew random samples (30%) to compare with full population metrics
Analyzed stability of means for variables like
bpm
,valence_%
, anddanceability_%
Plotted histograms and bivariate densities for sample subsets
Demonstrated how representative samples can approximate population behavior using mean estimation
Key Insights
Danceability and energy are among the most balanced features across genres
Streams are highly skewed and dominated by a few artists
BPM shows moderate variation but lacks strong correlation with popularity
Sampling provides good estimations for population metrics when randomness and size are controlled
Tools and Technologies
Jupyter Notebook
Pandas, Numpy, Seaborn, Matplotlib, Plotly