Music Streams Analyses

Music Streams Analyses

Sep 2024

Sep 2024

Analyses

Analyses

Overview

This project performs an extensive exploratory data analysis (EDA) on a Spotify dataset from 2023. The dataset contains multiple musical and contextual attributes for songs (e.g., tempo, energy, danceability, valence), allowing for deep insights into trends, patterns, and distributions within the music industry. The work spans from basic cleaning to advanced descriptive statistics, correlation analysis, group comparisons, and sample-based studies.

Objective

The main goal is to explore the data in depth to:

  • Understand how musical attributes behave and relate to each other

  • Identify patterns based on release years, artists, and popularity (streams)

  • Prepare the dataset for future modeling or clustering

  • Evaluate statistical measures and their consistency using sampling techniques

Dataset Description

The dataset includes attributes such as:

  • name, artist, released_year, bpm, energy, danceability_%, valence_%, liveness_%, streams, and genre.

  • The dataset was sourced from a CSV file (spotify-2023.csv) and loaded via Google Colab.

Data Cleaning and Preparation

Key steps included:

  • Detection of missing values using heatmaps

  • Conversion of string-based numeric fields (e.g. streams) to proper numeric types

  • Filtering and correcting inconsistencies in musical metrics

  • Use of .info(), .describe() and data type conversions for structuring

Descriptive Statistical Analysis

  • Applied .describe() to evaluate distribution, central tendency, and spread for key variables

  • Plotted histograms and density curves for metrics like BPM, valence, and streams

  • Used violin plots, boxplots and pairplots for distribution visualization

  • Compared means and medians between total population and random samples (30%)

Correlation and Relationship Analysis

  • Calculated Pearson correlation between attributes, such as bpm and streams

  • Identified that some features like liveness and valence had weak correlation to popularity

  • Explored temporal trends in attributes using groupby('released_year')

  • Created scatter plots to visualize bivariate relationships

Artist and Year-Based Aggregations

  • Grouped data by artist and released_year to calculate average metrics

  • Identified the most streamed artists and the most popular release years

  • Generated visualizations for top-performing songs and artists using bar charts and scatter plots

Sampling and Estimation Analysis

  • Drew random samples (30%) to compare with full population metrics

  • Analyzed stability of means for variables like bpm, valence_%, and danceability_%

  • Plotted histograms and bivariate densities for sample subsets

  • Demonstrated how representative samples can approximate population behavior using mean estimation

Key Insights

  • Danceability and energy are among the most balanced features across genres

  • Streams are highly skewed and dominated by a few artists

  • BPM shows moderate variation but lacks strong correlation with popularity

  • Sampling provides good estimations for population metrics when randomness and size are controlled

Tools and Technologies

  • Jupyter Notebook

  • Pandas, Numpy, Seaborn, Matplotlib, Plotly

Visualizations