|
|
--- |
|
|
license: mit |
|
|
license_name: joke |
|
|
license_link: LICENSE |
|
|
datasets: |
|
|
- ConquestAce/spotify-songs |
|
|
pipeline_tag: audio-classification |
|
|
tags: |
|
|
- music |
|
|
- spotify |
|
|
- machine-learning |
|
|
- music-prediction |
|
|
- data-science |
|
|
- regression |
|
|
- classification |
|
|
- popularity-analysis` |
|
|
--- |
|
|
|
|
|
# 🎵 Spotify Song Popularity Prediction |
|
|
|
|
|
Predict the popularity of a song based on its audio features and estimate potential Spotify royalties. |
|
|
|
|
|
[](LICENSE) |
|
|
|
|
|
--- |
|
|
|
|
|
## 📖 Project Overview |
|
|
|
|
|
This project explores machine learning models to predict the popularity of songs using publicly available features such as danceability, energy, tempo, and valence. It also demonstrates a prototype pricing tool that estimates potential Spotify revenue based on predicted popularity. |
|
|
|
|
|
Despite the challenges in accurately forecasting popularity due to time-evolving factors, our models show that **minimum popularity and expected revenue can be estimated** using machine learning techniques. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Dataset |
|
|
|
|
|
- **Source**: |
|
|
- Spotify Web API |
|
|
- Original Dataset (~114,000 songs) expanded to **~2 million songs** |
|
|
- **Features**: |
|
|
- Acoustic features (energy, danceability, valence, etc.) |
|
|
- Target variable: `popularity` (integer from 0–100) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔬 Methods |
|
|
|
|
|
- **Data Cleaning and Preparation**: |
|
|
- Removed zero-popularity entries, duplicates (~8% of rows), and outliers |
|
|
- Standardized genres using clustering |
|
|
- **Exploratory Data Analysis (EDA)**: |
|
|
- Analyzed distributions, correlations, and cumulative trends |
|
|
- **Modeling**: |
|
|
- Linear Regression, Ridge Regression |
|
|
- Decision Tree, Random Forest, AdaBoost (best recall: 86% on popular songs) |
|
|
- XGBoost (binning) and Neural Networks |
|
|
- **Revenue Estimation**: |
|
|
- Quadratic regression fit between predicted popularity and play counts |
|
|
- Prototype pricing tool predicting Spotify revenue for songs |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏆 Results |
|
|
|
|
|
| Model | Highlights | |
|
|
|-------------------------|----------------------------------------------| |
|
|
| Linear/Ridge Regression | Poor fit due to complex, noisy data | |
|
|
| Random Forest | Best overall stability (recall on populars) | |
|
|
| AdaBoost (weighted) | **Best performance**: 86% recall for popular songs | |
|
|
| Neural Networks | Showed challenges due to "popularity" instability | |
|
|
|
|
|
- Predicted revenue for a song with **popularity 55** ≈ **\$357,000 CAD**. |
|
|
- Pricing tool demonstrated practical viability despite prediction limitations. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📈 Example |
|
|
|
|
|
Predicting a song’s revenue based on its feature vector: |
|
|
|
|
|
```python |
|
|
# Example (simplified) |
|
|
predicted_popularity = model.predict(features) |
|
|
predicted_revenue = pricing_function(predicted_popularity) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 How to Run |
|
|
|
|
|
```bash |
|
|
# Clone this repo |
|
|
git clone https://huggingface.co/username/spotify-popularity-prediction |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Train or evaluate models |
|
|
python train_models.py |
|
|
python evaluate_models.py |
|
|
|
|
|
# Predict song revenue |
|
|
python pricing_tool.py |
|
|
``` |
|
|
|
|
|
(Adaptable scripts for different model types: AdaBoost, Random Forest, Neural Net.) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🤔 Limitations |
|
|
|
|
|
- Song features alone are **not sufficient** for high-accuracy predictions. |
|
|
- "Popularity" is a **time-dependent** and **dynamic** metric. |
|
|
- Genre diversity (>5000 unique genres) complicated modeling. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Future Work |
|
|
|
|
|
- Predict **play count** directly instead of popularity. |
|
|
- Fine-tune **XGBoost** and **deep neural networks** on larger datasets. |
|
|
- Integrate **time-evolution models** for dynamic popularity changes. |
|
|
- Improve genre classification with unsupervised learning (e.g., genre embeddings). |
|
|
|
|
|
--- |
|
|
|
|
|
# `popularity_predictor.pth` |
|
|
|
|
|
This neural network model is extremely weak. I was not good at data science when I made this |
|
|
|
|
|
## Iterations |
|
|
**null**: |
|
|
|
|
|
<details> |
|
|
<summary><b>Trained on 500 Epoch with 2.1 million song data from Spotify Database</b></summary> |
|
|
|
|
|
``` |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
import torch.optim as optim |
|
|
from sklearn.model_selection import train_test_split |
|
|
from sklearn.preprocessing import StandardScaler |
|
|
import pandas as pd |
|
|
|
|
|
|
|
|
# Split the data into features and target variable |
|
|
X = df[numerical_features[:-1]].values # all except popularity |
|
|
y = df['popularity'].values |
|
|
|
|
|
# Split into training and testing sets |
|
|
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
|
|
|
|
|
# Standardize the features |
|
|
scaler = StandardScaler() |
|
|
X_train = scaler.fit_transform(X_train) |
|
|
X_test = scaler.transform(X_test) |
|
|
|
|
|
# Convert to PyTorch tensors |
|
|
X_train_tensor = torch.FloatTensor(X_train) |
|
|
y_train_tensor = torch.FloatTensor(y_train).view(-1, 1) # shape to (N, 1) |
|
|
X_test_tensor = torch.FloatTensor(X_test) |
|
|
y_test_tensor = torch.FloatTensor(y_test).view(-1, 1) |
|
|
|
|
|
# Define the neural network model |
|
|
class PopularityPredictor(nn.Module): |
|
|
def __init__(self): |
|
|
super(PopularityPredictor, self).__init__() |
|
|
self.fc1 = nn.Linear(X_train.shape[1], 128) |
|
|
self.fc2 = nn.Linear(128, 64) |
|
|
self.fc3 = nn.Linear(64, 32) |
|
|
self.fc4 = nn.Linear(32, 1) |
|
|
|
|
|
def forward(self, x): |
|
|
x = torch.relu(self.fc1(x)) |
|
|
x = torch.relu(self.fc2(x)) |
|
|
x = self.fc3(x) |
|
|
return x |
|
|
|
|
|
# Create an instance of the model |
|
|
model = PopularityPredictor() |
|
|
|
|
|
# Define the loss function and optimizer |
|
|
criterion = nn.MSELoss() |
|
|
optimizer = optim.Adam(model.parameters(), lr=0.001) |
|
|
|
|
|
# Train the model |
|
|
num_epochs = 100 |
|
|
for epoch in range(num_epochs): |
|
|
model.train() |
|
|
optimizer.zero_grad() |
|
|
|
|
|
# Forward pass |
|
|
outputs = model(X_train_tensor) |
|
|
loss = criterion(outputs, y_train_tensor) |
|
|
|
|
|
# Backward pass and optimization |
|
|
loss.backward() |
|
|
optimizer.step() |
|
|
|
|
|
if (epoch+1) % 10 == 0: |
|
|
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}') |
|
|
|
|
|
# Evaluate the model |
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
predicted = model(X_test_tensor) |
|
|
|
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
If you use this project, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{bhuiyan2024spotify, |
|
|
title={Spotify Song Popularity Prediction}, |
|
|
author={Ashiful Bhuiyan, Blanca Fernández Méndez, Nazanin Ghelichi, Pavle Curcin}, |
|
|
year={2024}, |
|
|
institution={York University}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧑💻 Authors |
|
|
|
|
|
- Ashiful Bhuiyan |
|
|
- Blanca Elvira Fernández Méndez |
|
|
- Nazanin Ghelichi |
|
|
- Pavle Curcin |
|
|
|
|
|
--- |
|
|
## 📄 License |
|
|
|
|
|
This project is licensed under the [MIT License](LICENSE). |
|
|
|
|
|
## 🏷 Tags |
|
|
`#spotify` `#machine-learning` `#music-prediction` `#data-science` `#regression` `#classification` `#popularity-analysis` |