ConquestAce's picture
Update README.md
7d06ac6 verified
---
license: mit
license_name: joke
license_link: LICENSE
datasets:
- ConquestAce/spotify-songs
pipeline_tag: audio-classification
tags:
- music
- spotify
- machine-learning
- music-prediction
- data-science
- regression
- classification
- popularity-analysis`
---
# 🎵 Spotify Song Popularity Prediction
Predict the popularity of a song based on its audio features and estimate potential Spotify royalties.
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
---
## 📖 Project Overview
This project explores machine learning models to predict the popularity of songs using publicly available features such as danceability, energy, tempo, and valence. It also demonstrates a prototype pricing tool that estimates potential Spotify revenue based on predicted popularity.
Despite the challenges in accurately forecasting popularity due to time-evolving factors, our models show that **minimum popularity and expected revenue can be estimated** using machine learning techniques.
---
## 📊 Dataset
- **Source**:
- Spotify Web API
- Original Dataset (~114,000 songs) expanded to **~2 million songs**
- **Features**:
- Acoustic features (energy, danceability, valence, etc.)
- Target variable: `popularity` (integer from 0–100)
---
## 🔬 Methods
- **Data Cleaning and Preparation**:
- Removed zero-popularity entries, duplicates (~8% of rows), and outliers
- Standardized genres using clustering
- **Exploratory Data Analysis (EDA)**:
- Analyzed distributions, correlations, and cumulative trends
- **Modeling**:
- Linear Regression, Ridge Regression
- Decision Tree, Random Forest, AdaBoost (best recall: 86% on popular songs)
- XGBoost (binning) and Neural Networks
- **Revenue Estimation**:
- Quadratic regression fit between predicted popularity and play counts
- Prototype pricing tool predicting Spotify revenue for songs
---
## 🏆 Results
| Model | Highlights |
|-------------------------|----------------------------------------------|
| Linear/Ridge Regression | Poor fit due to complex, noisy data |
| Random Forest | Best overall stability (recall on populars) |
| AdaBoost (weighted) | **Best performance**: 86% recall for popular songs |
| Neural Networks | Showed challenges due to "popularity" instability |
- Predicted revenue for a song with **popularity 55****\$357,000 CAD**.
- Pricing tool demonstrated practical viability despite prediction limitations.
---
## 📈 Example
Predicting a song’s revenue based on its feature vector:
```python
# Example (simplified)
predicted_popularity = model.predict(features)
predicted_revenue = pricing_function(predicted_popularity)
```
---
## 🚀 How to Run
```bash
# Clone this repo
git clone https://huggingface.co/username/spotify-popularity-prediction
# Install dependencies
pip install -r requirements.txt
# Train or evaluate models
python train_models.py
python evaluate_models.py
# Predict song revenue
python pricing_tool.py
```
(Adaptable scripts for different model types: AdaBoost, Random Forest, Neural Net.)
---
## 🤔 Limitations
- Song features alone are **not sufficient** for high-accuracy predictions.
- "Popularity" is a **time-dependent** and **dynamic** metric.
- Genre diversity (>5000 unique genres) complicated modeling.
---
## 🧠 Future Work
- Predict **play count** directly instead of popularity.
- Fine-tune **XGBoost** and **deep neural networks** on larger datasets.
- Integrate **time-evolution models** for dynamic popularity changes.
- Improve genre classification with unsupervised learning (e.g., genre embeddings).
---
# `popularity_predictor.pth`
This neural network model is extremely weak. I was not good at data science when I made this
## Iterations
**null**:
<details>
<summary><b>Trained on 500 Epoch with 2.1 million song data from Spotify Database</b></summary>
```
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Split the data into features and target variable
X = df[numerical_features[:-1]].values # all except popularity
y = df['popularity'].values
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).view(-1, 1) # shape to (N, 1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).view(-1, 1)
# Define the neural network model
class PopularityPredictor(nn.Module):
def __init__(self):
super(PopularityPredictor, self).__init__()
self.fc1 = nn.Linear(X_train.shape[1], 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 32)
self.fc4 = nn.Linear(32, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# Create an instance of the model
model = PopularityPredictor()
# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train the model
num_epochs = 100
for epoch in range(num_epochs):
model.train()
optimizer.zero_grad()
# Forward pass
outputs = model(X_train_tensor)
loss = criterion(outputs, y_train_tensor)
# Backward pass and optimization
loss.backward()
optimizer.step()
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# Evaluate the model
model.eval()
with torch.no_grad():
predicted = model(X_test_tensor)
```
</details>
## 📚 Citation
If you use this project, please cite:
```bibtex
@misc{bhuiyan2024spotify,
title={Spotify Song Popularity Prediction},
author={Ashiful Bhuiyan, Blanca Fernández Méndez, Nazanin Ghelichi, Pavle Curcin},
year={2024},
institution={York University},
}
```
---
## 🧑‍💻 Authors
- Ashiful Bhuiyan
- Blanca Elvira Fernández Méndez
- Nazanin Ghelichi
- Pavle Curcin
---
## 📄 License
This project is licensed under the [MIT License](LICENSE).
## 🏷 Tags
`#spotify` `#machine-learning` `#music-prediction` `#data-science` `#regression` `#classification` `#popularity-analysis`