Spotify-Popularity-Predictor / README.md

Update README.md

7d06ac6 verified 8 months ago

6.58 kB

	---
	license: mit
	license_name: joke
	license_link: LICENSE
	datasets:
	- ConquestAce/spotify-songs
	pipeline_tag: audio-classification
	tags:
	- music
	- spotify
	- machine-learning
	- music-prediction
	- data-science
	- regression
	- classification
	- popularity-analysis`
	---

	# 🎵 Spotify Song Popularity Prediction

	Predict the popularity of a song based on its audio features and estimate potential Spotify royalties.

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

	---

	## 📖 Project Overview

	This project explores machine learning models to predict the popularity of songs using publicly available features such as danceability, energy, tempo, and valence. It also demonstrates a prototype pricing tool that estimates potential Spotify revenue based on predicted popularity.

	Despite the challenges in accurately forecasting popularity due to time-evolving factors, our models show that minimum popularity and expected revenue can be estimated using machine learning techniques.

	---

	## 📊 Dataset

	- Source:
	- Spotify Web API
	- Original Dataset (~114,000 songs) expanded to ~2 million songs
	- Features:
	- Acoustic features (energy, danceability, valence, etc.)
	- Target variable: `popularity` (integer from 0–100)

	---

	## 🔬 Methods

	- Data Cleaning and Preparation:
	- Removed zero-popularity entries, duplicates (~8% of rows), and outliers
	- Standardized genres using clustering
	- Exploratory Data Analysis (EDA):
	- Analyzed distributions, correlations, and cumulative trends
	- Modeling:
	- Linear Regression, Ridge Regression
	- Decision Tree, Random Forest, AdaBoost (best recall: 86% on popular songs)
	- XGBoost (binning) and Neural Networks
	- Revenue Estimation:
	- Quadratic regression fit between predicted popularity and play counts
	- Prototype pricing tool predicting Spotify revenue for songs

	---

	## 🏆 Results

	\| Model \| Highlights \|
	\|-------------------------\|----------------------------------------------\|
	\| Linear/Ridge Regression \| Poor fit due to complex, noisy data \|
	\| Random Forest \| Best overall stability (recall on populars) \|
	\| AdaBoost (weighted) \| Best performance: 86% recall for popular songs \|
	\| Neural Networks \| Showed challenges due to "popularity" instability \|

	- Predicted revenue for a song with popularity 55 ≈ \$357,000 CAD.
	- Pricing tool demonstrated practical viability despite prediction limitations.

	---

	## 📈 Example

	Predicting a song’s revenue based on its feature vector:

	```python
	# Example (simplified)
	predicted_popularity = model.predict(features)
	predicted_revenue = pricing_function(predicted_popularity)
	```

	---

	## 🚀 How to Run

	```bash
	# Clone this repo
	git clone https://huggingface.co/username/spotify-popularity-prediction

	# Install dependencies
	pip install -r requirements.txt

	# Train or evaluate models
	python train_models.py
	python evaluate_models.py

	# Predict song revenue
	python pricing_tool.py
	```

	(Adaptable scripts for different model types: AdaBoost, Random Forest, Neural Net.)

	---

	## 🤔 Limitations

	- Song features alone are not sufficient for high-accuracy predictions.
	- "Popularity" is a time-dependent and dynamic metric.
	- Genre diversity (>5000 unique genres) complicated modeling.

	---

	## 🧠 Future Work

	- Predict play count directly instead of popularity.
	- Fine-tune XGBoost and deep neural networks on larger datasets.
	- Integrate time-evolution models for dynamic popularity changes.
	- Improve genre classification with unsupervised learning (e.g., genre embeddings).

	---

	# `popularity_predictor.pth`

	This neural network model is extremely weak. I was not good at data science when I made this

	## Iterations
	null:

	<details>
	<summary><b>Trained on 500 Epoch with 2.1 million song data from Spotify Database</b></summary>

	```
	import torch
	import torch.nn as nn
	import torch.optim as optim
	from sklearn.model_selection import train_test_split
	from sklearn.preprocessing import StandardScaler
	import pandas as pd


	# Split the data into features and target variable
	X = df[numerical_features[:-1]].values # all except popularity
	y = df['popularity'].values

	# Split into training and testing sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

	# Standardize the features
	scaler = StandardScaler()
	X_train = scaler.fit_transform(X_train)
	X_test = scaler.transform(X_test)

	# Convert to PyTorch tensors
	X_train_tensor = torch.FloatTensor(X_train)
	y_train_tensor = torch.FloatTensor(y_train).view(-1, 1) # shape to (N, 1)
	X_test_tensor = torch.FloatTensor(X_test)
	y_test_tensor = torch.FloatTensor(y_test).view(-1, 1)

	# Define the neural network model
	class PopularityPredictor(nn.Module):
	def __init__(self):
	super(PopularityPredictor, self).__init__()
	self.fc1 = nn.Linear(X_train.shape[1], 128)
	self.fc2 = nn.Linear(128, 64)
	self.fc3 = nn.Linear(64, 32)
	self.fc4 = nn.Linear(32, 1)

	def forward(self, x):
	x = torch.relu(self.fc1(x))
	x = torch.relu(self.fc2(x))
	x = self.fc3(x)
	return x

	# Create an instance of the model
	model = PopularityPredictor()

	# Define the loss function and optimizer
	criterion = nn.MSELoss()
	optimizer = optim.Adam(model.parameters(), lr=0.001)

	# Train the model
	num_epochs = 100
	for epoch in range(num_epochs):
	model.train()
	optimizer.zero_grad()

	# Forward pass
	outputs = model(X_train_tensor)
	loss = criterion(outputs, y_train_tensor)

	# Backward pass and optimization
	loss.backward()
	optimizer.step()

	if (epoch+1) % 10 == 0:
	print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

	# Evaluate the model
	model.eval()
	with torch.no_grad():
	predicted = model(X_test_tensor)

	```

	</details>

	## 📚 Citation

	If you use this project, please cite:

	```bibtex
	@misc{bhuiyan2024spotify,
	title={Spotify Song Popularity Prediction},
	author={Ashiful Bhuiyan, Blanca Fernández Méndez, Nazanin Ghelichi, Pavle Curcin},
	year={2024},
	institution={York University},
	}
	```

	---

	## 🧑‍💻 Authors

	- Ashiful Bhuiyan
	- Blanca Elvira Fernández Méndez
	- Nazanin Ghelichi
	- Pavle Curcin

	---
	## 📄 License

	This project is licensed under the [MIT License](LICENSE).

	## 🏷 Tags
	`#spotify` `#machine-learning` `#music-prediction` `#data-science` `#regression` `#classification` `#popularity-analysis`