{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🚀 HRHUB - Bilateral Matching System\n",
    "\n",
    "## 🎯 Mathematical Framework:\n",
    "\n",
    "```\n",
    "Candidate ∈ ℝⁿ (multidimensional vector)\n",
    "Company ∈ ℝⁿ   (multidimensional vector)\n",
    "\n",
    "Both live in the SAME vector space!\n",
    "\n",
    "Match Score = cosine_similarity(v_candidate, v_company)\n",
    "```\n",
    "\n",
    "## 📊 Dataset:\n",
    "- **9,544 candidates** (35 dimensions)\n",
    "- **180,000 companies** (multiple dimensions from merged data)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📦 Step 1: Install & Import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q sentence-transformers plotly anthropic\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sentence_transformers import SentenceTransformer\n",
    "import plotly.express as px\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "print(\"✅ Ready!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📂 Step 2: Load & Merge Company Data\n",
    "\n",
    "Building rich 180K company entities by merging multiple tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"📂 Loading company datasets...\\n\")\n",
    "\n",
    "# Load base companies table\n",
    "companies_base = pd.read_csv('companies/companies.csv')\n",
    "print(f\"✅ Base companies: {len(companies_base):,} rows\")\n",
    "\n",
    "# Load additional company dimensions\n",
    "company_industries = pd.read_csv('companies/company_industries.csv')\n",
    "print(f\"✅ Company industries: {len(company_industries):,} rows\")\n",
    "\n",
    "company_specialties = pd.read_csv('companies/company_specialties.csv')\n",
    "print(f\"✅ Company specialties: {len(company_specialties):,} rows\")\n",
    "\n",
    "employee_counts = pd.read_csv('companies/employee_counts.csv')\n",
    "print(f\"✅ Employee counts: {len(employee_counts):,} rows\")\n",
    "\n",
    "# Load mappings (for reference)\n",
    "industries_map = pd.read_csv('mappings/industries.csv')\n",
    "skills_map = pd.read_csv('mappings/skills.csv')\n",
    "print(f\"✅ Mappings loaded\")\n",
    "\n",
    "print(f\"\\n📊 Base company columns: {companies_base.columns.tolist()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🔗 Step 3: Merge Company Data (Create Rich Entities)\n",
    "\n",
    "Aggregate multiple dimensions into single company profile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"🔗 Merging company data...\\n\")\n",
    "\n",
    "# Aggregate industries per company (many-to-many)\n",
    "company_industries_agg = company_industries.groupby('company_id')['industry_id'].apply(\n",
    "    lambda x: ', '.join(map(str, x.tolist()))\n",
    ").reset_index()\n",
    "company_industries_agg.columns = ['company_id', 'industries_list']\n",
    "\n",
    "print(f\"✅ Aggregated industries for {len(company_industries_agg):,} companies\")\n",
    "\n",
    "# Aggregate specialties per company\n",
    "company_specialties_agg = company_specialties.groupby('company_id')['specialty'].apply(\n",
    "    lambda x: ' | '.join(x.tolist())\n",
    ").reset_index()\n",
    "company_specialties_agg.columns = ['company_id', 'specialties_list']\n",
    "\n",
    "print(f\"✅ Aggregated specialties for {len(company_specialties_agg):,} companies\")\n",
    "\n",
    "# Merge everything into companies_base\n",
    "companies_full = companies_base.copy()\n",
    "\n",
    "# Merge industries\n",
    "companies_full = companies_full.merge(\n",
    "    company_industries_agg, \n",
    "    on='company_id', \n",
    "    how='left'\n",
    ")\n",
    "\n",
    "# Merge specialties\n",
    "companies_full = companies_full.merge(\n",
    "    company_specialties_agg, \n",
    "    on='company_id', \n",
    "    how='left'\n",
    ")\n",
    "\n",
    "# Merge employee counts\n",
    "companies_full = companies_full.merge(\n",
    "    employee_counts, \n",
    "    on='company_id', \n",
    "    how='left'\n",
    ")\n",
    "\n",
    "# Fill NaN\n",
    "companies_full = companies_full.fillna('')\n",
    "\n",
    "print(f\"\\n✅ MERGED DATASET CREATED!\")\n",
    "print(f\"📊 Final companies: {len(companies_full):,} rows × {len(companies_full.columns)} columns\")\n",
    "print(f\"\\n📋 Columns: {companies_full.columns.tolist()}\")\n",
    "\n",
    "# Show sample\n",
    "print(f\"\\n👀 Sample company:\")\n",
    "companies_full.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📂 Step 4: Load Candidates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load candidates\n",
    "candidates = pd.read_csv('resume_data.csv')\n",
    "candidates = candidates.fillna('')\n",
    "\n",
    "print(f\"✅ Loaded {len(candidates):,} candidates × {len(candidates.columns)} columns\")\n",
    "print(f\"\\n📋 Candidate columns: {candidates.columns.tolist()[:10]}...\")\n",
    "candidates.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📝 Step 5: Create Text Representations (ℝⁿ preparation)\n",
    "\n",
    "Transform structured data → unified text → embeddings → vectors ∈ ℝⁿ"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"📝 Creating text representations...\\n\")\n",
    "\n",
    "# Candidate text\n",
    "def make_candidate_text(row):\n",
    "    parts = []\n",
    "    \n",
    "    if row.get('skills'): \n",
    "        parts.append(f\"Skills: {row['skills']}\")\n",
    "    if row.get('career_objective'): \n",
    "        parts.append(f\"Objective: {row['career_objective']}\")\n",
    "    if row.get('educational_institution_name'): \n",
    "        parts.append(f\"Education: {row['educational_institution_name']}\")\n",
    "    if row.get('degree_names'): \n",
    "        parts.append(f\"Degree: {row['degree_names']}\")\n",
    "    if row.get('major_field_of_studies'): \n",
    "        parts.append(f\"Field: {row['major_field_of_studies']}\")\n",
    "    if row.get('positions'): \n",
    "        parts.append(f\"Experience: {row['positions']}\")\n",
    "    if row.get('responsibilities'): \n",
    "        parts.append(f\"Responsibilities: {str(row['responsibilities'])[:200]}\")\n",
    "    \n",
    "    return ' | '.join(parts) if parts else \"No info\"\n",
    "\n",
    "# Company text (from merged data!)\n",
    "def make_company_text(row):\n",
    "    parts = []\n",
    "    \n",
    "    if row.get('name'): \n",
    "        parts.append(f\"Company: {row['name']}\")\n",
    "    if row.get('description'): \n",
    "        parts.append(f\"Description: {str(row['description'])[:300]}\")\n",
    "    if row.get('industries_list'): \n",
    "        parts.append(f\"Industries: {row['industries_list']}\")\n",
    "    if row.get('specialties_list'): \n",
    "        parts.append(f\"Specialties: {row['specialties_list']}\")\n",
    "    if row.get('employee_count'): \n",
    "        parts.append(f\"Size: {row['employee_count']} employees\")\n",
    "    if row.get('follower_count'): \n",
    "        parts.append(f\"Followers: {row['follower_count']}\")\n",
    "    if row.get('city') or row.get('state') or row.get('country'): \n",
    "        loc = f\"{row.get('city', '')}, {row.get('state', '')}, {row.get('country', '')}\"\n",
    "        parts.append(f\"Location: {loc}\")\n",
    "    \n",
    "    return ' | '.join(parts) if parts else \"No info\"\n",
    "\n",
    "# Apply\n",
    "candidates['text'] = candidates.apply(make_candidate_text, axis=1)\n",
    "companies_full['text'] = companies_full.apply(make_company_text, axis=1)\n",
    "\n",
    "print(\"✅ Text created!\")\n",
    "print(f\"\\n📄 Sample candidate text:\\n{candidates['text'].iloc[0][:300]}...\")\n",
    "print(f\"\\n📄 Sample company text:\\n{companies_full['text'].iloc[0][:300]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🧠 Step 6: Generate Embeddings (Transform to ℝⁿ)\n",
    "\n",
    "**CRITICAL:** This creates vectors in the SAME mathematical space!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"🧠 Loading embedding model...\")\n",
    "model = SentenceTransformer('all-MiniLM-L6-v2')  # Creates 384-dim vectors\n",
    "\n",
    "print(f\"✅ Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}\")\n",
    "print(f\"\\n🔄 Generating candidate vectors (this may take a few minutes)...\")\n",
    "cand_vectors = model.encode(candidates['text'].tolist(), show_progress_bar=True)\n",
    "\n",
    "print(f\"\\n🔄 Generating company vectors (180K companies - this will take time!)...\")\n",
    "comp_vectors = model.encode(companies_full['text'].tolist(), show_progress_bar=True, batch_size=64)\n",
    "\n",
    "print(f\"\\n✅ VECTORS CREATED!\")\n",
    "print(f\"📊 Candidate vectors: {cand_vectors.shape}\")\n",
    "print(f\"📊 Company vectors: {comp_vectors.shape}\")\n",
    "print(f\"\\n🎯 Both live in ℝ^{model.get_sentence_embedding_dimension()} !\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🎯 Step 7: Matching Engine (Cosine Similarity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cosine_similarity(a, b):\n",
    "    \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n",
    "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n",
    "\n",
    "def find_top_matches(candidate_idx, top_k=10):\n",
    "    \"\"\"\n",
    "    Find top K company matches for a candidate.\n",
    "    \n",
    "    Returns: List of (company_idx, similarity_score)\n",
    "    \"\"\"\n",
    "    cand_vec = cand_vectors[candidate_idx]\n",
    "    \n",
    "    # Calculate similarities with ALL 180K companies\n",
    "    scores = []\n",
    "    for i, comp_vec in enumerate(comp_vectors):\n",
    "        score = cosine_similarity(cand_vec, comp_vec)\n",
    "        scores.append((i, score))\n",
    "    \n",
    "    # Sort by score (descending)\n",
    "    scores.sort(key=lambda x: x[1], reverse=True)\n",
    "    \n",
    "    return scores[:top_k]\n",
    "\n",
    "print(\"✅ Matching engine ready!\")\n",
    "print(f\"📊 Ready to match {len(candidates):,} candidates with {len(companies_full):,} companies!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🔍 Step 8: Test - Find Matches for Candidate #0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"🔍 Finding top 10 matches for Candidate #0...\\n\")\n",
    "\n",
    "matches = find_top_matches(0, top_k=10)\n",
    "\n",
    "print(\"🎯 Top 10 Company Matches:\\n\")\n",
    "print(\"=\" * 80)\n",
    "print(f\"{'Rank':<6} {'Score':<8} {'Company Name':<40} {'Industry'}\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "for rank, (comp_idx, score) in enumerate(matches, 1):\n",
    "    company_name = companies_full.iloc[comp_idx].get('name', 'N/A')[:40]\n",
    "    industry = companies_full.iloc[comp_idx].get('industries_list', 'N/A')[:30]\n",
    "    print(f\"{rank:<6} {score:.4f}   {company_name:<40} {industry}\")\n",
    "\n",
    "print(\"=\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📊 Step 9: Visualize Match Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get scores for sample\n",
    "all_scores = []\n",
    "sample_size = min(100, len(candidates))\n",
    "\n",
    "print(f\"📊 Computing match scores for {sample_size} candidates...\")\n",
    "\n",
    "for i in range(sample_size):\n",
    "    if i % 20 == 0:\n",
    "        print(f\"   Progress: {i}/{sample_size}\")\n",
    "    matches = find_top_matches(i, top_k=10)\n",
    "    for comp_idx, score in matches:\n",
    "        all_scores.append(score)\n",
    "\n",
    "# Plot\n",
    "fig = px.histogram(\n",
    "    x=all_scores,\n",
    "    nbins=50,\n",
    "    title=f'Distribution of Match Scores ({len(candidates):,} candidates × {len(companies_full):,} companies)',\n",
    "    labels={'x': 'Cosine Similarity Score'}\n",
    ")\n",
    "fig.show()\n",
    "\n",
    "print(f\"\\n📊 Statistics:\")\n",
    "print(f\"   Mean: {np.mean(all_scores):.4f}\")\n",
    "print(f\"   Median: {np.median(all_scores):.4f}\")\n",
    "print(f\"   Std: {np.std(all_scores):.4f}\")\n",
    "print(f\"   Max: {np.max(all_scores):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 💾 Step 10: Export Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate matches for sample\n",
    "results = []\n",
    "export_sample = min(500, len(candidates))  # Export matches for 500 candidates\n",
    "\n",
    "print(f\"💾 Generating matches for {export_sample} candidates...\\n\")\n",
    "\n",
    "for i in range(export_sample):\n",
    "    if i % 50 == 0:\n",
    "        print(f\"   Progress: {i}/{export_sample}\")\n",
    "    \n",
    "    matches = find_top_matches(i, top_k=10)\n",
    "    \n",
    "    for rank, (comp_idx, score) in enumerate(matches, 1):\n",
    "        results.append({\n",
    "            'candidate_id': i,\n",
    "            'company_id': companies_full.iloc[comp_idx].get('company_id'),\n",
    "            'company_name': companies_full.iloc[comp_idx].get('name', 'N/A'),\n",
    "            'rank': rank,\n",
    "            'similarity_score': float(score),\n",
    "            'industry': companies_full.iloc[comp_idx].get('industries_list', 'N/A')[:50]\n",
    "        })\n",
    "\n",
    "# Create DataFrame\n",
    "results_df = pd.DataFrame(results)\n",
    "results_df.to_csv('hrhub_matches.csv', index=False)\n",
    "\n",
    "print(f\"\\n✅ Exported {len(results_df):,} matches to hrhub_matches.csv\")\n",
    "print(f\"\\n👀 Preview:\")\n",
    "results_df.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🎉 DONE!\n",
    "\n",
    "### ✅ What you have:\n",
    "- **9,544 candidates** as vectors ∈ ℝ³⁸⁴\n",
    "- **180,000 companies** as vectors ∈ ℝ³⁸⁴\n",
    "- Both in the SAME mathematical space!\n",
    "- Cosine similarity matching\n",
    "- Exported results\n",
    "\n",
    "### 🚀 Next steps:\n",
    "1. Add LLM explanations (optional - needs API key)\n",
    "2. Implement user weights for dimensions\n",
    "3. Build UI/API on top"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}