{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# πŸš€ HRHUB - Bilateral Matching System\n", "\n", "## 🎯 Mathematical Framework:\n", "\n", "```\n", "Candidate ∈ ℝⁿ (multidimensional vector)\n", "Company ∈ ℝⁿ (multidimensional vector)\n", "\n", "Both live in the SAME vector space!\n", "\n", "Match Score = cosine_similarity(v_candidate, v_company)\n", "```\n", "\n", "## πŸ“Š Dataset:\n", "- **9,544 candidates** (35 dimensions)\n", "- **180,000 companies** (multiple dimensions from merged data)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ“¦ Step 1: Install & Import" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q sentence-transformers plotly anthropic\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from sentence_transformers import SentenceTransformer\n", "import plotly.express as px\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "print(\"βœ… Ready!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ“‚ Step 2: Load & Merge Company Data\n", "\n", "Building rich 180K company entities by merging multiple tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"πŸ“‚ Loading company datasets...\\n\")\n", "\n", "# Load base companies table\n", "companies_base = pd.read_csv('companies/companies.csv')\n", "print(f\"βœ… Base companies: {len(companies_base):,} rows\")\n", "\n", "# Load additional company dimensions\n", "company_industries = pd.read_csv('companies/company_industries.csv')\n", "print(f\"βœ… Company industries: {len(company_industries):,} rows\")\n", "\n", "company_specialties = pd.read_csv('companies/company_specialties.csv')\n", "print(f\"βœ… Company specialties: {len(company_specialties):,} rows\")\n", "\n", "employee_counts = pd.read_csv('companies/employee_counts.csv')\n", "print(f\"βœ… Employee counts: {len(employee_counts):,} rows\")\n", "\n", "# Load mappings (for reference)\n", "industries_map = pd.read_csv('mappings/industries.csv')\n", "skills_map = pd.read_csv('mappings/skills.csv')\n", "print(f\"βœ… Mappings loaded\")\n", "\n", "print(f\"\\nπŸ“Š Base company columns: {companies_base.columns.tolist()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ”— Step 3: Merge Company Data (Create Rich Entities)\n", "\n", "Aggregate multiple dimensions into single company profile." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"πŸ”— Merging company data...\\n\")\n", "\n", "# Aggregate industries per company (many-to-many)\n", "company_industries_agg = company_industries.groupby('company_id')['industry_id'].apply(\n", " lambda x: ', '.join(map(str, x.tolist()))\n", ").reset_index()\n", "company_industries_agg.columns = ['company_id', 'industries_list']\n", "\n", "print(f\"βœ… Aggregated industries for {len(company_industries_agg):,} companies\")\n", "\n", "# Aggregate specialties per company\n", "company_specialties_agg = company_specialties.groupby('company_id')['specialty'].apply(\n", " lambda x: ' | '.join(x.tolist())\n", ").reset_index()\n", "company_specialties_agg.columns = ['company_id', 'specialties_list']\n", "\n", "print(f\"βœ… Aggregated specialties for {len(company_specialties_agg):,} companies\")\n", "\n", "# Merge everything into companies_base\n", "companies_full = companies_base.copy()\n", "\n", "# Merge industries\n", "companies_full = companies_full.merge(\n", " company_industries_agg, \n", " on='company_id', \n", " how='left'\n", ")\n", "\n", "# Merge specialties\n", "companies_full = companies_full.merge(\n", " company_specialties_agg, \n", " on='company_id', \n", " how='left'\n", ")\n", "\n", "# Merge employee counts\n", "companies_full = companies_full.merge(\n", " employee_counts, \n", " on='company_id', \n", " how='left'\n", ")\n", "\n", "# Fill NaN\n", "companies_full = companies_full.fillna('')\n", "\n", "print(f\"\\nβœ… MERGED DATASET CREATED!\")\n", "print(f\"πŸ“Š Final companies: {len(companies_full):,} rows Γ— {len(companies_full.columns)} columns\")\n", "print(f\"\\nπŸ“‹ Columns: {companies_full.columns.tolist()}\")\n", "\n", "# Show sample\n", "print(f\"\\nπŸ‘€ Sample company:\")\n", "companies_full.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ“‚ Step 4: Load Candidates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load candidates\n", "candidates = pd.read_csv('resume_data.csv')\n", "candidates = candidates.fillna('')\n", "\n", "print(f\"βœ… Loaded {len(candidates):,} candidates Γ— {len(candidates.columns)} columns\")\n", "print(f\"\\nπŸ“‹ Candidate columns: {candidates.columns.tolist()[:10]}...\")\n", "candidates.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ“ Step 5: Create Text Representations (ℝⁿ preparation)\n", "\n", "Transform structured data β†’ unified text β†’ embeddings β†’ vectors ∈ ℝⁿ" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"πŸ“ Creating text representations...\\n\")\n", "\n", "# Candidate text\n", "def make_candidate_text(row):\n", " parts = []\n", " \n", " if row.get('skills'): \n", " parts.append(f\"Skills: {row['skills']}\")\n", " if row.get('career_objective'): \n", " parts.append(f\"Objective: {row['career_objective']}\")\n", " if row.get('educational_institution_name'): \n", " parts.append(f\"Education: {row['educational_institution_name']}\")\n", " if row.get('degree_names'): \n", " parts.append(f\"Degree: {row['degree_names']}\")\n", " if row.get('major_field_of_studies'): \n", " parts.append(f\"Field: {row['major_field_of_studies']}\")\n", " if row.get('positions'): \n", " parts.append(f\"Experience: {row['positions']}\")\n", " if row.get('responsibilities'): \n", " parts.append(f\"Responsibilities: {str(row['responsibilities'])[:200]}\")\n", " \n", " return ' | '.join(parts) if parts else \"No info\"\n", "\n", "# Company text (from merged data!)\n", "def make_company_text(row):\n", " parts = []\n", " \n", " if row.get('name'): \n", " parts.append(f\"Company: {row['name']}\")\n", " if row.get('description'): \n", " parts.append(f\"Description: {str(row['description'])[:300]}\")\n", " if row.get('industries_list'): \n", " parts.append(f\"Industries: {row['industries_list']}\")\n", " if row.get('specialties_list'): \n", " parts.append(f\"Specialties: {row['specialties_list']}\")\n", " if row.get('employee_count'): \n", " parts.append(f\"Size: {row['employee_count']} employees\")\n", " if row.get('follower_count'): \n", " parts.append(f\"Followers: {row['follower_count']}\")\n", " if row.get('city') or row.get('state') or row.get('country'): \n", " loc = f\"{row.get('city', '')}, {row.get('state', '')}, {row.get('country', '')}\"\n", " parts.append(f\"Location: {loc}\")\n", " \n", " return ' | '.join(parts) if parts else \"No info\"\n", "\n", "# Apply\n", "candidates['text'] = candidates.apply(make_candidate_text, axis=1)\n", "companies_full['text'] = companies_full.apply(make_company_text, axis=1)\n", "\n", "print(\"βœ… Text created!\")\n", "print(f\"\\nπŸ“„ Sample candidate text:\\n{candidates['text'].iloc[0][:300]}...\")\n", "print(f\"\\nπŸ“„ Sample company text:\\n{companies_full['text'].iloc[0][:300]}...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🧠 Step 6: Generate Embeddings (Transform to ℝⁿ)\n", "\n", "**CRITICAL:** This creates vectors in the SAME mathematical space!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"🧠 Loading embedding model...\")\n", "model = SentenceTransformer('all-MiniLM-L6-v2') # Creates 384-dim vectors\n", "\n", "print(f\"βœ… Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}\")\n", "print(f\"\\nπŸ”„ Generating candidate vectors (this may take a few minutes)...\")\n", "cand_vectors = model.encode(candidates['text'].tolist(), show_progress_bar=True)\n", "\n", "print(f\"\\nπŸ”„ Generating company vectors (180K companies - this will take time!)...\")\n", "comp_vectors = model.encode(companies_full['text'].tolist(), show_progress_bar=True, batch_size=64)\n", "\n", "print(f\"\\nβœ… VECTORS CREATED!\")\n", "print(f\"πŸ“Š Candidate vectors: {cand_vectors.shape}\")\n", "print(f\"πŸ“Š Company vectors: {comp_vectors.shape}\")\n", "print(f\"\\n🎯 Both live in ℝ^{model.get_sentence_embedding_dimension()} !\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🎯 Step 7: Matching Engine (Cosine Similarity)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cosine_similarity(a, b):\n", " \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n", " return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n", "\n", "def find_top_matches(candidate_idx, top_k=10):\n", " \"\"\"\n", " Find top K company matches for a candidate.\n", " \n", " Returns: List of (company_idx, similarity_score)\n", " \"\"\"\n", " cand_vec = cand_vectors[candidate_idx]\n", " \n", " # Calculate similarities with ALL 180K companies\n", " scores = []\n", " for i, comp_vec in enumerate(comp_vectors):\n", " score = cosine_similarity(cand_vec, comp_vec)\n", " scores.append((i, score))\n", " \n", " # Sort by score (descending)\n", " scores.sort(key=lambda x: x[1], reverse=True)\n", " \n", " return scores[:top_k]\n", "\n", "print(\"βœ… Matching engine ready!\")\n", "print(f\"πŸ“Š Ready to match {len(candidates):,} candidates with {len(companies_full):,} companies!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ” Step 8: Test - Find Matches for Candidate #0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"πŸ” Finding top 10 matches for Candidate #0...\\n\")\n", "\n", "matches = find_top_matches(0, top_k=10)\n", "\n", "print(\"🎯 Top 10 Company Matches:\\n\")\n", "print(\"=\" * 80)\n", "print(f\"{'Rank':<6} {'Score':<8} {'Company Name':<40} {'Industry'}\")\n", "print(\"=\" * 80)\n", "\n", "for rank, (comp_idx, score) in enumerate(matches, 1):\n", " company_name = companies_full.iloc[comp_idx].get('name', 'N/A')[:40]\n", " industry = companies_full.iloc[comp_idx].get('industries_list', 'N/A')[:30]\n", " print(f\"{rank:<6} {score:.4f} {company_name:<40} {industry}\")\n", "\n", "print(\"=\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ“Š Step 9: Visualize Match Distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get scores for sample\n", "all_scores = []\n", "sample_size = min(100, len(candidates))\n", "\n", "print(f\"πŸ“Š Computing match scores for {sample_size} candidates...\")\n", "\n", "for i in range(sample_size):\n", " if i % 20 == 0:\n", " print(f\" Progress: {i}/{sample_size}\")\n", " matches = find_top_matches(i, top_k=10)\n", " for comp_idx, score in matches:\n", " all_scores.append(score)\n", "\n", "# Plot\n", "fig = px.histogram(\n", " x=all_scores,\n", " nbins=50,\n", " title=f'Distribution of Match Scores ({len(candidates):,} candidates Γ— {len(companies_full):,} companies)',\n", " labels={'x': 'Cosine Similarity Score'}\n", ")\n", "fig.show()\n", "\n", "print(f\"\\nπŸ“Š Statistics:\")\n", "print(f\" Mean: {np.mean(all_scores):.4f}\")\n", "print(f\" Median: {np.median(all_scores):.4f}\")\n", "print(f\" Std: {np.std(all_scores):.4f}\")\n", "print(f\" Max: {np.max(all_scores):.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ’Ύ Step 10: Export Results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate matches for sample\n", "results = []\n", "export_sample = min(500, len(candidates)) # Export matches for 500 candidates\n", "\n", "print(f\"πŸ’Ύ Generating matches for {export_sample} candidates...\\n\")\n", "\n", "for i in range(export_sample):\n", " if i % 50 == 0:\n", " print(f\" Progress: {i}/{export_sample}\")\n", " \n", " matches = find_top_matches(i, top_k=10)\n", " \n", " for rank, (comp_idx, score) in enumerate(matches, 1):\n", " results.append({\n", " 'candidate_id': i,\n", " 'company_id': companies_full.iloc[comp_idx].get('company_id'),\n", " 'company_name': companies_full.iloc[comp_idx].get('name', 'N/A'),\n", " 'rank': rank,\n", " 'similarity_score': float(score),\n", " 'industry': companies_full.iloc[comp_idx].get('industries_list', 'N/A')[:50]\n", " })\n", "\n", "# Create DataFrame\n", "results_df = pd.DataFrame(results)\n", "results_df.to_csv('hrhub_matches.csv', index=False)\n", "\n", "print(f\"\\nβœ… Exported {len(results_df):,} matches to hrhub_matches.csv\")\n", "print(f\"\\nπŸ‘€ Preview:\")\n", "results_df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸŽ‰ DONE!\n", "\n", "### βœ… What you have:\n", "- **9,544 candidates** as vectors ∈ ℝ³⁸⁴\n", "- **180,000 companies** as vectors ∈ ℝ³⁸⁴\n", "- Both in the SAME mathematical space!\n", "- Cosine similarity matching\n", "- Exported results\n", "\n", "### πŸš€ Next steps:\n", "1. Add LLM explanations (optional - needs API key)\n", "2. Implement user weights for dimensions\n", "3. Build UI/API on top" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }