Ajayan commited on
Commit
40784d8
·
1 Parent(s): 0a18348

Update space

Browse files
Files changed (3) hide show
  1. README.md +128 -14
  2. app.py +103 -51
  3. requirements.txt +80 -0
README.md CHANGED
@@ -1,14 +1,128 @@
1
- ---
2
- title: Book Title Recomender
3
- emoji: 💬
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.0.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: Suggests top 5 title from available dataset
12
- ---
13
-
14
- An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📚 Content-Based Book Recommendation System 📖
2
+
3
+ This is a content-based book recommendation system that recommends books 📕📗 similar to an input book title based on the similarity of book summaries. The system uses **TF-IDF (📊 Term Frequency-Inverse Document Frequency)** and **Cosine Similarity 🧮** to compare books and find the most relevant recommendations. It provides a user-friendly interface built with **Gradio 💻**, where users can enter a book title and get recommendations.
4
+
5
+ ## 📂 Project Structure
6
+
7
+ ```
8
+ .
9
+ ├── app.py # 🚀 Main script that runs the app
10
+ ├── utils.py # 🛠️ Helper functions (data loading, model loading)
11
+ ├── data/
12
+ │ ├── books_summary.csv # 📑 Actual dataset
13
+ │ ├── cleaned_books_summary.csv # 🧹 Preprocessed dataset
14
+ ├── model/
15
+ │ ├── tfidf_vectorizer.pkl # 🤖 Pre-trained TF-IDF vectorizer
16
+ │ ├── tfidf_matrix.pkl # 🗂️ Pre-calculated TF-IDF matrix
17
+ ├── src/ # 📦 Source code folder
18
+ │ ├── data_loader.py # 📥 Module to load and preprocess data
19
+ │ ├── feature_engineering.py # 🧬 Module to create TF-IDF/embedding vectors
20
+ │ ├── similarity_calculator.py # 🧮 Module to calculate similarity matrix
21
+ │ ├── recommender.py # 📚 Main logic to generate recommendations
22
+ │ ├── utils.py # ⚙️ Utility functions (e.g., cleaning text)
23
+ ├── requirements.txt # 📜 List of Python dependencies
24
+ └── README.md # 📝 Project overview and setup instructions
25
+ ```
26
+
27
+ ## 🌟 Features
28
+
29
+ - **📚 Book Recommendation:** Enter a book title, and the system will recommend the top 5 similar books based on their summaries.
30
+ - **🏷️ Categorization:** Each recommended book displays its categories as clickable buttons for better user experience.
31
+ - **💻 Interactive UI:** Simple and clean interface using Gradio.
32
+ - **🔧 Modular Code:** Functions for data loading, preprocessing, model training, and similarity calculation are separated into different files.
33
+
34
+ ## 💻 Technologies Used
35
+
36
+ - **🐍 Python:** Core language used to build the system.
37
+ - **💻 Gradio:** For creating a web-based user interface.
38
+ - **📊 Scikit-learn:** For TF-IDF Vectorization and Cosine Similarity calculation.
39
+ - **🗂️ Pandas:** For data manipulation and preprocessing.
40
+ - **🔢 NumPy:** For numerical operations.
41
+
42
+ ## ⚙️ Setup Instructions
43
+
44
+ ### 1. 🧬 Clone the Repository
45
+
46
+ ```bash
47
+ git clone https://github.com/ajayansaroj17/book_title_recommender.git
48
+ cd book_title_recommender
49
+ ```
50
+
51
+ ### 2. 📦 Install Dependencies
52
+
53
+ Make sure you have Python 3.7+ installed. Then, create a virtual environment and install the required libraries:
54
+
55
+ ```bash
56
+ python -m venv venv
57
+ source venv/bin/activate # On Windows, use `venv\Scripts\activate`
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+ ### 3. 📚 Download or Prepare the Dataset
62
+
63
+ The dataset should contain columns for `book_name`, `summaries`, and `categories`. Store the preprocessed dataset as `cleaned_books_summary.csv` in the `data/` folder.
64
+
65
+ ### 4. 🏋️‍♂️ Pre-train the Model
66
+
67
+ Run the following script to train the TF-IDF model and create the TF-IDF matrix:
68
+
69
+ ```bash
70
+ python train_tfidf_model.py
71
+ ```
72
+
73
+ This will:
74
+
75
+ - 🧠 Train the TF-IDF vectorizer on the summaries column.
76
+ - 🗂️ Create the TF-IDF matrix for all books.
77
+ - 💾 Save the trained model and matrix as `model/tfidf_vectorizer.pkl` and `model/tfidf_matrix.pkl`.
78
+
79
+ ### 5. 🚀 Run the Application
80
+
81
+ Launch the Gradio-based web interface by running:
82
+
83
+ ```bash
84
+ python app.py
85
+ ```
86
+
87
+ The application will open in your browser, allowing you to enter a book title and receive recommendations.
88
+
89
+ ## 🤔 How the System Works
90
+
91
+ 1. **👤 User Input:** The user enters a book title in the input field.
92
+ 2. **🔍 Recommendation Logic:**
93
+ - The system searches for the input book in the dataset.
94
+ - It calculates the TF-IDF vector of the input book's summary and compares it with the summaries of all other books using cosine similarity.
95
+ - The top 5 books with the highest similarity scores are returned.
96
+ 3. **📊 Output:** Recommendations are displayed, including book titles, summaries, and categories as clickable buttons.
97
+
98
+ ## 📄 File Descriptions
99
+
100
+ - **app.py:** 🚀 Main script launching the Gradio UI and handling book recommendations.
101
+ - **utils.py:** 🛠�� Helper functions for loading models, data preprocessing, and utilities.
102
+ - **feature_engineering.py:** 🧬 Trains the TF-IDF model and creates the TF-IDF matrix.
103
+ - **data/cleaned_books_summary.csv:** 📚 Cleaned dataset used for training.
104
+ - **model/tfidf_vectorizer.pkl** and **model/tfidf_matrix.pkl:** 🤖 Pre-trained TF-IDF model and matrix.
105
+
106
+ ## 📦 Dependencies
107
+
108
+ Install the following Python packages using:
109
+
110
+ ```bash
111
+ pip install -r requirements.txt
112
+ ```
113
+
114
+ - **💻 gradio:** For the web interface.
115
+ - **📊 sklearn:** For TF-IDF and cosine similarity calculations.
116
+ - **🗂️ pandas:** For data manipulation.
117
+ - **🔢 numpy:** For numerical operations.
118
+
119
+ ## 🚀 Potential Extensions and Improvements
120
+
121
+ - **🏷️ Category-Based Filtering:** Filter recommendations by specific categories.
122
+ - **🤖 Advanced NLP Techniques:** Use embeddings like Word2Vec, GloVe, or transformer-based models like BERT.
123
+ - **👥 Personalization:** Implement a user profiling system for personalized recommendations.
124
+ - **⚡ Scalability:** Use Approximate Nearest Neighbors (ANN) for faster similarity calculation on large datasets.
125
+
126
+ ## 🏁 Conclusion
127
+
128
+ This project demonstrates building a content-based book recommendation system using TF-IDF and cosine similarity. The modular design ensures easy maintenance and extension, while Gradio simplifies deployment and user interaction!🚀
app.py CHANGED
@@ -1,64 +1,116 @@
 
 
1
  import gradio as gr
2
- from huggingface_hub import InferenceClient
3
 
4
- """
5
- For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
6
- """
7
- client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")
8
 
 
 
 
 
 
 
9
 
10
- def respond(
11
- message,
12
- history: list[tuple[str, str]],
13
- system_message,
14
- max_tokens,
15
- temperature,
16
- top_p,
17
- ):
18
- messages = [{"role": "system", "content": system_message}]
19
 
20
- for val in history:
21
- if val[0]:
22
- messages.append({"role": "user", "content": val[0]})
23
- if val[1]:
24
- messages.append({"role": "assistant", "content": val[1]})
25
 
26
- messages.append({"role": "user", "content": message})
 
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  response = ""
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- for message in client.chat_completion(
31
- messages,
32
- max_tokens=max_tokens,
33
- stream=True,
34
- temperature=temperature,
35
- top_p=top_p,
36
- ):
37
- token = message.choices[0].delta.content
38
-
39
- response += token
40
- yield response
41
-
42
-
43
- """
44
- For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
45
- """
46
- demo = gr.ChatInterface(
47
- respond,
48
- additional_inputs=[
49
- gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
50
- gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
51
- gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
52
- gr.Slider(
53
- minimum=0.1,
54
- maximum=1.0,
55
- value=0.95,
56
- step=0.05,
57
- label="Top-p (nucleus sampling)",
58
- ),
59
- ],
60
  )
61
 
62
 
63
  if __name__ == "__main__":
64
- demo.launch()
 
 
1
+ import pandas as pd
2
+ from sklearn.metrics.pairwise import cosine_similarity
3
  import gradio as gr
4
+ from src.utils import load_from_pickle, validate_input
5
 
6
+ VECTOR_PATH = "model/tfidf_vectorizer.pkl"
7
+ MATRIX_PATH = "model/tfidf_matrix.pkl"
8
+ DATA_PATH = "data/books_summary.csv"
 
9
 
10
+ # 1. Load the pre-trained models and data
11
+ print("Loading models and data...")
12
+ tfidf_vectorizer = load_from_pickle(VECTOR_PATH)
13
+ tfidf_matrix = load_from_pickle(MATRIX_PATH)
14
+ books_df = pd.read_csv(DATA_PATH)
15
+ print(f"Original dataset shape: {books_df.shape}")
16
 
17
+ # Group by 'book_name' and 'summaries', and aggregate 'categories' into a single cell
 
 
 
 
 
 
 
 
18
 
19
+ books_df = books_df.groupby(["book_name", "summaries"], as_index=False).agg(
20
+ {"categories": lambda tags: ", ".join(set(tags.dropna()))}
21
+ ) # Remove duplicates within tags
22
+ print(f"After aggregating categories: {books_df.shape}")
 
23
 
24
+ # Drop duplicates (just to be extra cautious)
25
+ books_df = books_df.drop_duplicates(subset=["book_name", "summaries"], keep="first")
26
 
27
+ book_titles = books_df["book_name"].tolist()
28
+ print("Models and data loaded successfully!")
29
+
30
+
31
+ # 2. Recommendation Function
32
+ def recommend_books(input_book_title):
33
+ """
34
+ Recommends top 5 similar books based on the input book title.
35
+
36
+ Args:
37
+ input_book_title (str): The title of the book input by the user.
38
+
39
+ Returns:
40
+ List of recommended books with their summaries and tags.
41
+ """
42
+ # Validate input
43
+ if not validate_input(input_book_title, book_titles):
44
+ return "Book title not found in the dataset. Please try another title."
45
+
46
+ # Find index of the input book
47
+ book_index = books_df[books_df["book_name"] == input_book_title].index[0]
48
+
49
+ # Compute cosine similarity
50
+ cosine_similarities = cosine_similarity(
51
+ tfidf_matrix[book_index], tfidf_matrix
52
+ ).flatten()
53
+
54
+ # Sort and get top 5 similar books (excluding the input book itself)
55
+ similar_indices = cosine_similarities.argsort()[-6:-1][::-1]
56
+ recommendations = books_df.iloc[similar_indices]
57
+
58
+ """# Format the output
59
+ output = []
60
+ for _, row in recommendations.iterrows():
61
+ output.append(f"**Title:** {row['book_name']}\n**Summary:** {row['summaries']}\n**Tags:** {row['categories']}\n")
62
+
63
+ return "\n\n".join(output)"""
64
+ # Format the recommendations for the UI
65
+ formatted_books = []
66
+ for _, row in recommendations.iterrows():
67
+ formatted_books.append(
68
+ {
69
+ "title": row["book_name"],
70
+ "description": row["summaries"],
71
+ "categories": row["categories"].split(", "),
72
+ }
73
+ )
74
+
75
+ return formatted_books
76
+
77
+
78
+ def display_recommendations(book_title):
79
+ """
80
+ Wrapper function to display recommendations.
81
+ """
82
+ result = recommend_books(book_title)
83
+
84
+ if isinstance(result, str): # If it's an error message
85
+ return result
86
+
87
+ # Construct formatted HTML response for book recommendations
88
  response = ""
89
+ for book in result:
90
+ response += f"""
91
+ <div style='border:1px solid #ddd; border-radius:10px; padding:10px; margin:10px; box-shadow:2px 2px 8px #ccc;'>
92
+ <h2 style='color:#333;'>{book['title']}</h2>
93
+ <p style='color:#555;'>{book['description']}</p>
94
+ <div>
95
+ {" ".join([f"<button style='background-color:#007BFF; color:white; border:none; padding:5px 10px; margin:2px; border-radius:5px;'>{tag}</button>" for tag in book['categories']])}
96
+ </div>
97
+ </div>
98
+ """
99
+ return response
100
+
101
 
102
+ # 3. Gradio Interface
103
+ # Gradio UI definition
104
+ interface = gr.Interface(
105
+ fn=display_recommendations,
106
+ inputs=gr.Textbox(label="Enter Book Title", placeholder="e.g., The Great Gatsby"),
107
+ outputs=gr.HTML(label="Top 5 Recommendations"),
108
+ title="📚 Book Recommendation System",
109
+ description="Enter the title of a book, and we'll recommend 5 similar books.",
110
+ theme="compact",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  )
112
 
113
 
114
  if __name__ == "__main__":
115
+ # Run the Gradio interface when app.py is executed
116
+ interface.launch()
requirements.txt CHANGED
@@ -1 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  huggingface_hub==0.25.2
 
1
+ aiofiles==23.2.1
2
+ annotated-types==0.7.0
3
+ anyio==4.7.0
4
+ black==24.10.0
5
+ certifi==2024.8.30
6
+ charset-normalizer==3.4.0
7
+ click==8.1.7
8
+ colorama==0.4.6
9
+ contourpy==1.3.0
10
+ cycler==0.12.1
11
+ exceptiongroup==1.2.2
12
+ fastapi==0.115.6
13
+ ffmpy==0.4.0
14
+ filelock==3.16.1
15
+ fonttools==4.55.3
16
+ fsspec==2024.10.0
17
+ gradio==4.44.1
18
+ gradio_client==1.3.0
19
+ h11==0.14.0
20
+ httpcore==1.0.7
21
+ httpx==0.28.1
22
+ huggingface-hub==0.26.5
23
+ idna==3.10
24
+ importlib_resources==6.4.5
25
+ Jinja2==3.1.4
26
+ joblib==1.4.2
27
+ kiwisolver==1.4.7
28
+ markdown-it-py==3.0.0
29
+ MarkupSafe==2.1.5
30
+ matplotlib==3.9.4
31
+ mdurl==0.1.2
32
+ mpmath==1.3.0
33
+ mypy-extensions==1.0.0
34
+ networkx==3.2.1
35
+ numpy==2.0.2
36
+ orjson==3.10.12
37
+ packaging==24.2
38
+ pandas==2.2.3
39
+ pathspec==0.12.1
40
+ pillow==10.4.0
41
+ platformdirs==4.3.6
42
+ plotly==5.24.1
43
+ pydantic==2.10.3
44
+ pydantic_core==2.27.1
45
+ pydub==0.25.1
46
+ Pygments==2.18.0
47
+ pyparsing==3.2.0
48
+ python-dateutil==2.9.0.post0
49
+ python-multipart==0.0.19
50
+ pytz==2024.2
51
+ PyYAML==6.0.2
52
+ regex==2024.11.6
53
+ requests==2.32.3
54
+ rich==13.9.4
55
+ ruff==0.8.3
56
+ safetensors==0.4.5
57
+ scikit-learn==1.6.0
58
+ scipy==1.13.1
59
+ semantic-version==2.10.0
60
+ sentence-transformers==3.3.1
61
+ shellingham==1.5.4
62
+ six==1.17.0
63
+ sniffio==1.3.1
64
+ starlette==0.41.3
65
+ sympy==1.13.1
66
+ tenacity==9.0.0
67
+ threadpoolctl==3.5.0
68
+ tokenizers==0.21.0
69
+ tomli==2.2.1
70
+ tomlkit==0.12.0
71
+ torch==2.5.1
72
+ tqdm==4.67.1
73
+ transformers==4.47.0
74
+ typer==0.15.1
75
+ typing_extensions==4.12.2
76
+ tzdata==2024.2
77
+ urllib3==2.2.3
78
+ uvicorn==0.33.0
79
+ websockets==12.0
80
+ zipp==3.21.0
81
  huggingface_hub==0.25.2