Spaces:

Ajayan
/

book_title_recomender

Sleeping

App Files Files Community

Ajayan commited on Dec 16, 2024

Commit

40784d8

1 Parent(s): 0a18348

Update space

Browse files

Files changed (3) hide show

README.md +128 -14
app.py +103 -51
requirements.txt +80 -0

README.md CHANGED Viewed

@@ -1,14 +1,128 @@
----
-title: Book Title Recomender
-emoji: 💬
-colorFrom: yellow
-colorTo: purple
-sdk: gradio
-sdk_version: 5.0.1
-app_file: app.py
-pinned: false
-license: mit
-short_description: Suggests top 5 title from available dataset
----
-An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

+# 📚 Content-Based Book Recommendation System 📖
+This is a content-based book recommendation system that recommends books 📕📗 similar to an input book title based on the similarity of book summaries. The system uses **TF-IDF (📊 Term Frequency-Inverse Document Frequency)** and **Cosine Similarity 🧮** to compare books and find the most relevant recommendations. It provides a user-friendly interface built with **Gradio 💻**, where users can enter a book title and get recommendations.
+## 📂 Project Structure
+```
+.
+├── app.py                         # 🚀 Main script that runs the app
+├── utils.py                       # 🛠️ Helper functions (data loading, model loading)
+├── data/
+│   ├── books_summary.csv          # 📑 Actual dataset
+│   ├── cleaned_books_summary.csv  # 🧹 Preprocessed dataset
+├── model/
+│   ├── tfidf_vectorizer.pkl       # 🤖 Pre-trained TF-IDF vectorizer
+│   ├── tfidf_matrix.pkl           # 🗂️ Pre-calculated TF-IDF matrix
+├── src/                           # 📦 Source code folder
+│   ├── data_loader.py             # 📥 Module to load and preprocess data
+│   ├── feature_engineering.py     # 🧬 Module to create TF-IDF/embedding vectors
+│   ├── similarity_calculator.py   # 🧮 Module to calculate similarity matrix
+│   ├── recommender.py             # 📚 Main logic to generate recommendations
+│   ├── utils.py                   # ⚙️ Utility functions (e.g., cleaning text)
+├── requirements.txt               # 📜 List of Python dependencies
+└── README.md                      # 📝 Project overview and setup instructions
+```
+## 🌟 Features
+- **📚 Book Recommendation:** Enter a book title, and the system will recommend the top 5 similar books based on their summaries.
+- **🏷️ Categorization:** Each recommended book displays its categories as clickable buttons for better user experience.
+- **💻 Interactive UI:** Simple and clean interface using Gradio.
+- **🔧 Modular Code:** Functions for data loading, preprocessing, model training, and similarity calculation are separated into different files.
+## 💻 Technologies Used
+- **🐍 Python:** Core language used to build the system.
+- **💻 Gradio:** For creating a web-based user interface.
+- **📊 Scikit-learn:** For TF-IDF Vectorization and Cosine Similarity calculation.
+- **🗂️ Pandas:** For data manipulation and preprocessing.
+- **🔢 NumPy:** For numerical operations.
+## ⚙️ Setup Instructions
+### 1. 🧬 Clone the Repository
+```bash
+git clone https://github.com/ajayansaroj17/book_title_recommender.git
+cd book_title_recommender
+```
+### 2. 📦 Install Dependencies
+Make sure you have Python 3.7+ installed. Then, create a virtual environment and install the required libraries:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
+pip install -r requirements.txt
+```
+### 3. 📚 Download or Prepare the Dataset
+The dataset should contain columns for `book_name`, `summaries`, and `categories`. Store the preprocessed dataset as `cleaned_books_summary.csv` in the `data/` folder.
+### 4. 🏋️‍♂️ Pre-train the Model
+Run the following script to train the TF-IDF model and create the TF-IDF matrix:
+```bash
+python train_tfidf_model.py
+```
+This will:
+- 🧠 Train the TF-IDF vectorizer on the summaries column.
+- 🗂️ Create the TF-IDF matrix for all books.
+- 💾 Save the trained model and matrix as `model/tfidf_vectorizer.pkl` and `model/tfidf_matrix.pkl`.
+### 5. 🚀 Run the Application
+Launch the Gradio-based web interface by running:
+```bash
+python app.py
+```
+The application will open in your browser, allowing you to enter a book title and receive recommendations.
+## 🤔 How the System Works
+1. **👤 User Input:** The user enters a book title in the input field.
+2. **🔍 Recommendation Logic:**
+   - The system searches for the input book in the dataset.
+   - It calculates the TF-IDF vector of the input book's summary and compares it with the summaries of all other books using cosine similarity.
+   - The top 5 books with the highest similarity scores are returned.
+3. **📊 Output:** Recommendations are displayed, including book titles, summaries, and categories as clickable buttons.
+## 📄 File Descriptions
+- **app.py:** 🚀 Main script launching the Gradio UI and handling book recommendations.
+- **utils.py:** 🛠�� Helper functions for loading models, data preprocessing, and utilities.
+- **feature_engineering.py:** 🧬 Trains the TF-IDF model and creates the TF-IDF matrix.
+- **data/cleaned_books_summary.csv:** 📚 Cleaned dataset used for training.
+- **model/tfidf_vectorizer.pkl** and **model/tfidf_matrix.pkl:** 🤖 Pre-trained TF-IDF model and matrix.
+## 📦 Dependencies
+Install the following Python packages using:
+```bash
+pip install -r requirements.txt
+```
+- **💻 gradio:** For the web interface.
+- **📊 sklearn:** For TF-IDF and cosine similarity calculations.
+- **🗂️ pandas:** For data manipulation.
+- **🔢 numpy:** For numerical operations.
+## 🚀 Potential Extensions and Improvements
+- **🏷️ Category-Based Filtering:** Filter recommendations by specific categories.
+- **🤖 Advanced NLP Techniques:** Use embeddings like Word2Vec, GloVe, or transformer-based models like BERT.
+- **👥 Personalization:** Implement a user profiling system for personalized recommendations.
+- **⚡ Scalability:** Use Approximate Nearest Neighbors (ANN) for faster similarity calculation on large datasets.
+## 🏁 Conclusion
+This project demonstrates building a content-based book recommendation system using TF-IDF and cosine similarity. The modular design ensures easy maintenance and extension, while Gradio simplifies deployment and user interaction!🚀

app.py CHANGED Viewed

@@ -1,64 +1,116 @@
 import gradio as gr
-from huggingface_hub import InferenceClient
-"""
-For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
-"""
-client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")
-def respond(
-    message,
-    history: list[tuple[str, str]],
-    system_message,
-    max_tokens,
-    temperature,
-    top_p,
-):
-    messages = [{"role": "system", "content": system_message}]
-    for val in history:
-        if val[0]:
-            messages.append({"role": "user", "content": val[0]})
-        if val[1]:
-            messages.append({"role": "assistant", "content": val[1]})
-    messages.append({"role": "user", "content": message})
     response = ""
-    for message in client.chat_completion(
-        messages,
-        max_tokens=max_tokens,
-        stream=True,
-        temperature=temperature,
-        top_p=top_p,
-    ):
-        token = message.choices[0].delta.content
-        response += token
-        yield response
-"""
-For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
-"""
-demo = gr.ChatInterface(
-    respond,
-    additional_inputs=[
-        gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
-        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
-        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
-        gr.Slider(
-            minimum=0.1,
-            maximum=1.0,
-            value=0.95,
-            step=0.05,
-            label="Top-p (nucleus sampling)",
-        ),
-    ],
 )
 if __name__ == "__main__":
-    demo.launch()

+import pandas as pd
+from sklearn.metrics.pairwise import cosine_similarity
 import gradio as gr
+from src.utils import load_from_pickle, validate_input
+VECTOR_PATH = "model/tfidf_vectorizer.pkl"
+MATRIX_PATH = "model/tfidf_matrix.pkl"
+DATA_PATH = "data/books_summary.csv"
+# 1. Load the pre-trained models and data
+print("Loading models and data...")
+tfidf_vectorizer = load_from_pickle(VECTOR_PATH)
+tfidf_matrix = load_from_pickle(MATRIX_PATH)
+books_df = pd.read_csv(DATA_PATH)
+print(f"Original dataset shape: {books_df.shape}")
+# Group by 'book_name' and 'summaries', and aggregate 'categories' into a single cell
+books_df = books_df.groupby(["book_name", "summaries"], as_index=False).agg(
+    {"categories": lambda tags: ", ".join(set(tags.dropna()))}
+)  # Remove duplicates within tags
+print(f"After aggregating categories: {books_df.shape}")
+# Drop duplicates (just to be extra cautious)
+books_df = books_df.drop_duplicates(subset=["book_name", "summaries"], keep="first")
+book_titles = books_df["book_name"].tolist()
+print("Models and data loaded successfully!")
+# 2. Recommendation Function
+def recommend_books(input_book_title):
+    """
+    Recommends top 5 similar books based on the input book title.
+    Args:
+        input_book_title (str): The title of the book input by the user.
+    Returns:
+        List of recommended books with their summaries and tags.
+    """
+    # Validate input
+    if not validate_input(input_book_title, book_titles):
+        return "Book title not found in the dataset. Please try another title."
+    # Find index of the input book
+    book_index = books_df[books_df["book_name"] == input_book_title].index[0]
+    # Compute cosine similarity
+    cosine_similarities = cosine_similarity(
+        tfidf_matrix[book_index], tfidf_matrix
+    ).flatten()
+    # Sort and get top 5 similar books (excluding the input book itself)
+    similar_indices = cosine_similarities.argsort()[-6:-1][::-1]
+    recommendations = books_df.iloc[similar_indices]
+    """# Format the output
+    output = []
+    for _, row in recommendations.iterrows():
+        output.append(f"**Title:** {row['book_name']}\n**Summary:** {row['summaries']}\n**Tags:** {row['categories']}\n")
+    return "\n\n".join(output)"""
+    # Format the recommendations for the UI
+    formatted_books = []
+    for _, row in recommendations.iterrows():
+        formatted_books.append(
+            {
+                "title": row["book_name"],
+                "description": row["summaries"],
+                "categories": row["categories"].split(", "),
+            }
+        )
+    return formatted_books
+def display_recommendations(book_title):
+    """
+    Wrapper function to display recommendations.
+    """
+    result = recommend_books(book_title)
+    if isinstance(result, str):  # If it's an error message
+        return result
+    # Construct formatted HTML response for book recommendations
     response = ""
+    for book in result:
+        response += f"""
+        <div style='border:1px solid #ddd; border-radius:10px; padding:10px; margin:10px; box-shadow:2px 2px 8px #ccc;'>
+            <h2 style='color:#333;'>{book['title']}</h2>
+            <p style='color:#555;'>{book['description']}</p>
+            <div>
+                {" ".join([f"<button style='background-color:#007BFF; color:white; border:none; padding:5px 10px; margin:2px; border-radius:5px;'>{tag}</button>" for tag in book['categories']])}
+            </div>
+        </div>
+        """
+    return response
+# 3. Gradio Interface
+# Gradio UI definition
+interface = gr.Interface(
+    fn=display_recommendations,
+    inputs=gr.Textbox(label="Enter Book Title", placeholder="e.g., The Great Gatsby"),
+    outputs=gr.HTML(label="Top 5 Recommendations"),
+    title="📚 Book Recommendation System",
+    description="Enter the title of a book, and we'll recommend 5 similar books.",
+    theme="compact",
 )
 if __name__ == "__main__":
+    # Run the Gradio interface when app.py is executed
+    interface.launch()

requirements.txt CHANGED Viewed

	@@ -1 +1,81 @@
















































































1	huggingface_hub==0.25.2

+aiofiles==23.2.1
+annotated-types==0.7.0
+anyio==4.7.0
+black==24.10.0
+certifi==2024.8.30
+charset-normalizer==3.4.0
+click==8.1.7
+colorama==0.4.6
+contourpy==1.3.0
+cycler==0.12.1
+exceptiongroup==1.2.2
+fastapi==0.115.6
+ffmpy==0.4.0
+filelock==3.16.1
+fonttools==4.55.3
+fsspec==2024.10.0
+gradio==4.44.1
+gradio_client==1.3.0
+h11==0.14.0
+httpcore==1.0.7
+httpx==0.28.1
+huggingface-hub==0.26.5
+idna==3.10
+importlib_resources==6.4.5
+Jinja2==3.1.4
+joblib==1.4.2
+kiwisolver==1.4.7
+markdown-it-py==3.0.0
+MarkupSafe==2.1.5
+matplotlib==3.9.4
+mdurl==0.1.2
+mpmath==1.3.0
+mypy-extensions==1.0.0
+networkx==3.2.1
+numpy==2.0.2
+orjson==3.10.12
+packaging==24.2
+pandas==2.2.3
+pathspec==0.12.1
+pillow==10.4.0
+platformdirs==4.3.6
+plotly==5.24.1
+pydantic==2.10.3
+pydantic_core==2.27.1
+pydub==0.25.1
+Pygments==2.18.0
+pyparsing==3.2.0
+python-dateutil==2.9.0.post0
+python-multipart==0.0.19
+pytz==2024.2
+PyYAML==6.0.2
+regex==2024.11.6
+requests==2.32.3
+rich==13.9.4
+ruff==0.8.3
+safetensors==0.4.5
+scikit-learn==1.6.0
+scipy==1.13.1
+semantic-version==2.10.0
+sentence-transformers==3.3.1
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+starlette==0.41.3
+sympy==1.13.1
+tenacity==9.0.0
+threadpoolctl==3.5.0
+tokenizers==0.21.0
+tomli==2.2.1
+tomlkit==0.12.0
+torch==2.5.1
+tqdm==4.67.1
+transformers==4.47.0
+typer==0.15.1
+typing_extensions==4.12.2
+tzdata==2024.2
+urllib3==2.2.3
+uvicorn==0.33.0
+websockets==12.0
+zipp==3.21.0
 huggingface_hub==0.25.2