# üß† HRHUB v2.1 - Enhanced with LLM (FREE VERSION)

## üìò Project Overview

**Bilateral HR Matching System with LLM-Powered Intelligence**

### What's New in v2.1:
- ‚úÖ **FREE LLM**: Using Hugging Face Inference API (no cost)
- ‚úÖ **Job Level Classification**: Zero-shot & few-shot learning
- ‚úÖ **Structured Skills Extraction**: Pydantic schemas
- ‚úÖ **Match Explainability**: LLM-generated reasoning
- ‚úÖ **Flexible Data Loading**: Upload OR Google Drive

### Tech Stack:
```
Embeddings: sentence-transformers (local, free)
LLM: Hugging Face Inference API (free tier)
Schemas: Pydantic
Platform: Google Colab ‚Üí VS Code
```

---

**Master's Thesis - Aalborg University**  
*Business Data Science Program*  
*December 2025*

---
## üìä Step 1: Install Dependencies

In [1]:
# Install required packages
#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy

print("‚úÖ All packages installed!")

‚úÖ All packages installed!


---
## üìä Step 2: Import Libraries

In [2]:
import pandas as pd
import numpy as np
import json
import os
from typing import List, Dict, Optional, Literal
import warnings
warnings.filterwarnings('ignore')

# ML & NLP
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# LLM Integration (FREE)
from huggingface_hub import InferenceClient
from pydantic import BaseModel, Field

# Visualization
import plotly.graph_objects as go
from IPython.display import HTML, display

# Configuration Settings
from dotenv import load_dotenv

# Carrega vari√°veis do .env
load_dotenv()
print("‚úÖ Environment variables loaded from .env")

print("‚úÖ All libraries imported!")

‚úÖ Environment variables loaded from .env
‚úÖ All libraries imported!


---
## üìä Step 3: Configuration

In [3]:
class Config:
    """Centralized configuration for VS Code"""
    
    # Paths - VS Code structure
    CSV_PATH = '../csv_files/'
    PROCESSED_PATH = '../processed/'
    RESULTS_PATH = '../results/'
    
    # Embedding Model
    EMBEDDING_MODEL = 'all-MiniLM-L6-v2'
    
    # LLM Settings (FREE - Hugging Face)
    HF_TOKEN = os.getenv('HF_TOKEN', '')  # ‚úÖ Pega do .env
    LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'
    
    LLM_MAX_TOKENS = 1000
    
    # Matching Parameters
    TOP_K_MATCHES = 10
    SIMILARITY_THRESHOLD = 0.5
    RANDOM_SEED = 42

np.random.seed(Config.RANDOM_SEED)

print("‚úÖ Configuration loaded!")
print(f"üß† Embedding model: {Config.EMBEDDING_MODEL}")
print(f"ü§ñ LLM model: {Config.LLM_MODEL}")
print(f"üîë HF Token configured: {'Yes ‚úÖ' if Config.HF_TOKEN else 'No ‚ö†Ô∏è'}")
print(f"üìÇ Data path: {Config.CSV_PATH}")

‚úÖ Configuration loaded!
üß† Embedding model: all-MiniLM-L6-v2
ü§ñ LLM model: meta-llama/Llama-3.2-3B-Instruct
üîë HF Token configured: Yes ‚úÖ
üìÇ Data path: ../csv_files/


---
## üìä Step 4: Load All Datasets

In [4]:
print("üìÇ Loading all datasets...\n")
print("=" * 70)

# Load main datasets
candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')
print(f"‚úÖ Candidates: {len(candidates):,} rows √ó {len(candidates.columns)} columns")

companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')
print(f"‚úÖ Companies (base): {len(companies_base):,} rows")

company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')
print(f"‚úÖ Company industries: {len(company_industries):,} rows")

company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')
print(f"‚úÖ Company specialties: {len(company_specialties):,} rows")

employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')
print(f"‚úÖ Employee counts: {len(employee_counts):,} rows")

postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')
print(f"‚úÖ Postings: {len(postings):,} rows √ó {len(postings.columns)} columns")

# Optional datasets
try:
    job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')
    print(f"‚úÖ Job skills: {len(job_skills):,} rows")
except:
    job_skills = None
    print("‚ö†Ô∏è  Job skills not found (optional)")

try:
    job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')
    print(f"‚úÖ Job industries: {len(job_industries):,} rows")
except:
    job_industries = None
    print("‚ö†Ô∏è  Job industries not found (optional)")

print("\n" + "=" * 70)
print("‚úÖ All datasets loaded successfully!\n")

üìÇ Loading all datasets...

‚úÖ Candidates: 9,544 rows √ó 35 columns
‚úÖ Companies (base): 24,473 rows
‚úÖ Company industries: 24,375 rows
‚úÖ Company specialties: 169,387 rows
‚úÖ Employee counts: 35,787 rows
‚úÖ Postings: 123,849 rows √ó 31 columns
‚úÖ Job skills: 213,768 rows
‚úÖ Job industries: 164,808 rows

‚úÖ All datasets loaded successfully!



---
## üìä Step 5: Merge & Enrich Company Data

In [5]:
print("üîó Merging company data...\n")

# Aggregate industries
company_industries_agg = company_industries.groupby('company_id')['industry'].apply(
    lambda x: ', '.join(map(str, x.tolist()))
).reset_index()
company_industries_agg.columns = ['company_id', 'industries_list']
print(f"‚úÖ Aggregated industries for {len(company_industries_agg):,} companies")

# Aggregate specialties
company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(
    lambda x: ' | '.join(x.astype(str).tolist())
).reset_index()
company_specialties_agg.columns = ['company_id', 'specialties_list']
print(f"‚úÖ Aggregated specialties for {len(company_specialties_agg):,} companies")

# Merge all company data
companies_merged = companies_base.copy()
companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')
companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')
companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')

print(f"\n‚úÖ Base company merge complete: {len(companies_merged):,} companies\n")

üîó Merging company data...

‚úÖ Aggregated industries for 24,365 companies
‚úÖ Aggregated specialties for 17,780 companies

‚úÖ Base company merge complete: 35,787 companies



---
## üìä Step 6: Enrich with Job Postings

In [6]:
print("üåâ Enriching companies with job posting data...\n")
print("=" * 70)
print("KEY INSIGHT: Postings = 'Requirements Language Bridge'")
print("=" * 70 + "\n")

postings = postings.fillna('')
postings['company_id'] = postings['company_id'].astype(str)

# Aggregate postings per company
postings_agg = postings.groupby('company_id').agg({
    'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),
    'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),
    'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),
    'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),
}).reset_index()

postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']

companies_merged['company_id'] = companies_merged['company_id'].astype(str)
companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')

print(f"‚úÖ Enriched {len(companies_full):,} companies with posting data\n")

üåâ Enriching companies with job posting data...

KEY INSIGHT: Postings = 'Requirements Language Bridge'

‚úÖ Enriched 35,787 companies with posting data



In [7]:
companies_full.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url,industries_list,specialties_list,employee_count,follower_count,time_recorded,posted_job_titles,posted_descriptions,required_skills,experience_levels
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,314102,16253625,1712378162,,,,
1,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,313142,16309464,1713392385,,,,
2,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,313147,16309985,1713402495,,,,
3,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,311223,16314846,1713501255,,,,
4,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare,Hospitals and Health Care,Healthcare | Biotechnology,56873,2185368,1712382540,,,,


In [8]:
## üîç Data Quality Check - Duplicate Detection

"""
Checking for duplicates in all datasets based on primary keys.
This cell only REPORTS duplicates, does not modify data.
"""

print("=" * 80)
print("üîç DUPLICATE DETECTION REPORT")
print("=" * 80)
print()

# Define primary keys for each dataset
duplicate_report = []

# 1. Candidates
print("‚îå‚îÄ üìä resume_data.csv (Candidates)")
print(f"‚îÇ  Primary Key: Resume_ID")
cand_total = len(candidates)
cand_unique = candidates['Resume_ID'].nunique() if 'Resume_ID' in candidates.columns else len(candidates)
cand_dups = cand_total - cand_unique
print(f"‚îÇ  Total rows:     {cand_total:,}")
print(f"‚îÇ  Unique rows:    {cand_unique:,}")
print(f"‚îÇ  Duplicates:     {cand_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if cand_dups == 0 else 'üî¥ HAS DUPLICATES'}")
print("‚îî‚îÄ\n")
duplicate_report.append(('Candidates', cand_total, cand_unique, cand_dups))

# 2. Companies Base
print("‚îå‚îÄ üìä companies.csv (Companies Base)")
print(f"‚îÇ  Primary Key: company_id")
comp_total = len(companies_base)
comp_unique = companies_base['company_id'].nunique()
comp_dups = comp_total - comp_unique
print(f"‚îÇ  Total rows:     {comp_total:,}")
print(f"‚îÇ  Unique rows:    {comp_unique:,}")
print(f"‚îÇ  Duplicates:     {comp_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if comp_dups == 0 else 'üî¥ HAS DUPLICATES'}")
if comp_dups > 0:
    dup_ids = companies_base[companies_base.duplicated('company_id', keep=False)]['company_id'].value_counts().head(3)
    print(f"‚îÇ  Top duplicates:")
    for cid, count in dup_ids.items():
        print(f"‚îÇ    - company_id={cid}: {count} times")
print("‚îî‚îÄ\n")
duplicate_report.append(('Companies Base', comp_total, comp_unique, comp_dups))

# 3. Company Industries
print("‚îå‚îÄ üìä company_industries.csv")
print(f"‚îÇ  Primary Key: company_id + industry")
ci_total = len(company_industries)
ci_unique = len(company_industries.drop_duplicates(subset=['company_id', 'industry']))
ci_dups = ci_total - ci_unique
print(f"‚îÇ  Total rows:     {ci_total:,}")
print(f"‚îÇ  Unique rows:    {ci_unique:,}")
print(f"‚îÇ  Duplicates:     {ci_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if ci_dups == 0 else 'üî¥ HAS DUPLICATES'}")
print("‚îî‚îÄ\n")
duplicate_report.append(('Company Industries', ci_total, ci_unique, ci_dups))

# 4. Company Specialties
print("‚îå‚îÄ üìä company_specialities.csv")
print(f"‚îÇ  Primary Key: company_id + speciality")
cs_total = len(company_specialties)
cs_unique = len(company_specialties.drop_duplicates(subset=['company_id', 'speciality']))
cs_dups = cs_total - cs_unique
print(f"‚îÇ  Total rows:     {cs_total:,}")
print(f"‚îÇ  Unique rows:    {cs_unique:,}")
print(f"‚îÇ  Duplicates:     {cs_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if cs_dups == 0 else 'üî¥ HAS DUPLICATES'}")
print("‚îî‚îÄ\n")
duplicate_report.append(('Company Specialties', cs_total, cs_unique, cs_dups))

# 5. Employee Counts
print("‚îå‚îÄ üìä employee_counts.csv")
print(f"‚îÇ  Primary Key: company_id")
ec_total = len(employee_counts)
ec_unique = employee_counts['company_id'].nunique()
ec_dups = ec_total - ec_unique
print(f"‚îÇ  Total rows:     {ec_total:,}")
print(f"‚îÇ  Unique rows:    {ec_unique:,}")
print(f"‚îÇ  Duplicates:     {ec_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if ec_dups == 0 else 'üî¥ HAS DUPLICATES'}")
print("‚îî‚îÄ\n")
duplicate_report.append(('Employee Counts', ec_total, ec_unique, ec_dups))

# 6. Postings
print("‚îå‚îÄ üìä postings.csv (Job Postings)")
print(f"‚îÇ  Primary Key: job_id")
if 'job_id' in postings.columns:
    post_total = len(postings)
    post_unique = postings['job_id'].nunique()
    post_dups = post_total - post_unique
else:
    post_total = len(postings)
    post_unique = len(postings.drop_duplicates())
    post_dups = post_total - post_unique
print(f"‚îÇ  Total rows:     {post_total:,}")
print(f"‚îÇ  Unique rows:    {post_unique:,}")
print(f"‚îÇ  Duplicates:     {post_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if post_dups == 0 else 'üî¥ HAS DUPLICATES'}")
print("‚îî‚îÄ\n")
duplicate_report.append(('Postings', post_total, post_unique, post_dups))

# 7. Companies Full (After Merge)
print("‚îå‚îÄ üìä companies_full (After Enrichment)")
print(f"‚îÇ  Primary Key: company_id")
cf_total = len(companies_full)
cf_unique = companies_full['company_id'].nunique()
cf_dups = cf_total - cf_unique
print(f"‚îÇ  Total rows:     {cf_total:,}")
print(f"‚îÇ  Unique rows:    {cf_unique:,}")
print(f"‚îÇ  Duplicates:     {cf_dups:,}")
print(f"‚îÇ  Status:         {'‚úÖ CLEAN' if cf_dups == 0 else 'üî¥ HAS DUPLICATES'}")
if cf_dups > 0:
    dup_ids = companies_full[companies_full.duplicated('company_id', keep=False)]['company_id'].value_counts().head(5)
    print(f"‚îÇ")
    print(f"‚îÇ  Top duplicate company_ids:")
    for cid, count in dup_ids.items():
        comp_name = companies_full[companies_full['company_id'] == cid]['name'].iloc[0]
        print(f"‚îÇ    - {cid} ({comp_name}): {count} times")
print("‚îî‚îÄ\n")
duplicate_report.append(('Companies Full', cf_total, cf_unique, cf_dups))

# Summary
print("=" * 80)
print("üìä SUMMARY")
print("=" * 80)
print()

total_dups = sum(r[3] for r in duplicate_report)
clean_datasets = sum(1 for r in duplicate_report if r[3] == 0)
dirty_datasets = len(duplicate_report) - clean_datasets

print(f"‚úÖ Clean datasets:          {clean_datasets}/{len(duplicate_report)}")
print(f"üî¥ Datasets with duplicates: {dirty_datasets}/{len(duplicate_report)}")
print(f"üóëÔ∏è  Total duplicates found:  {total_dups:,} rows")
print()

if dirty_datasets > 0:
    print("‚ö†Ô∏è  DUPLICATES DETECTED!")
else:
    print("‚úÖ All datasets are clean! No duplicates found.")

print("=" * 80)

üîç DUPLICATE DETECTION REPORT

‚îå‚îÄ üìä resume_data.csv (Candidates)
‚îÇ  Primary Key: Resume_ID
‚îÇ  Total rows:     9,544
‚îÇ  Unique rows:    9,544
‚îÇ  Duplicates:     0
‚îÇ  Status:         ‚úÖ CLEAN
‚îî‚îÄ

‚îå‚îÄ üìä companies.csv (Companies Base)
‚îÇ  Primary Key: company_id
‚îÇ  Total rows:     24,473
‚îÇ  Unique rows:    24,473
‚îÇ  Duplicates:     0
‚îÇ  Status:         ‚úÖ CLEAN
‚îî‚îÄ

‚îå‚îÄ üìä company_industries.csv
‚îÇ  Primary Key: company_id + industry
‚îÇ  Total rows:     24,375
‚îÇ  Unique rows:    24,375
‚îÇ  Duplicates:     0
‚îÇ  Status:         ‚úÖ CLEAN
‚îî‚îÄ

‚îå‚îÄ üìä company_specialities.csv
‚îÇ  Primary Key: company_id + speciality
‚îÇ  Total rows:     169,387
‚îÇ  Unique rows:    169,387
‚îÇ  Duplicates:     0
‚îÇ  Status:         ‚úÖ CLEAN
‚îî‚îÄ

‚îå‚îÄ üìä employee_counts.csv
‚îÇ  Primary Key: company_id
‚îÇ  Total rows:     35,787
‚îÇ  Unique rows:    24,473
‚îÇ  Duplicates:     11,314
‚îÇ  Status:         üî¥ HAS DUPLICATES
‚îî‚îÄ

‚îå‚îÄ

In [9]:
"""
## üßπ Data Cleaning - Remove Duplicates

Based on the report above, removing duplicates from datasets.
"""

print("üßπ CLEANING DUPLICATES...\n")
print("=" * 80)

# Store original counts
original_counts = {}

# 1. Clean Companies Base (if needed)
if len(companies_base) != companies_base['company_id'].nunique():
    original_counts['companies_base'] = len(companies_base)
    companies_base = companies_base.drop_duplicates(subset=['company_id'], keep='first')
    removed = original_counts['companies_base'] - len(companies_base)
    print(f"‚úÖ companies_base:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['companies_base']:,} ‚Üí {len(companies_base):,} rows\n")
else:
    print(f"‚úÖ companies_base: Already clean\n")

# 2. Clean Company Industries (if needed)
if len(company_industries) != len(company_industries.drop_duplicates(subset=['company_id', 'industry'])):
    original_counts['company_industries'] = len(company_industries)
    company_industries = company_industries.drop_duplicates(subset=['company_id', 'industry'], keep='first')
    removed = original_counts['company_industries'] - len(company_industries)
    print(f"‚úÖ company_industries:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['company_industries']:,} ‚Üí {len(company_industries):,} rows\n")
else:
    print(f"‚úÖ company_industries: Already clean\n")

# 3. Clean Company Specialties (if needed)
if len(company_specialties) != len(company_specialties.drop_duplicates(subset=['company_id', 'speciality'])):
    original_counts['company_specialties'] = len(company_specialties)
    company_specialties = company_specialties.drop_duplicates(subset=['company_id', 'speciality'], keep='first')
    removed = original_counts['company_specialties'] - len(company_specialties)
    print(f"‚úÖ company_specialties:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['company_specialties']:,} ‚Üí {len(company_specialties):,} rows\n")
else:
    print(f"‚úÖ company_specialties: Already clean\n")

# 4. Clean Employee Counts (if needed)
if len(employee_counts) != employee_counts['company_id'].nunique():
    original_counts['employee_counts'] = len(employee_counts)
    employee_counts = employee_counts.drop_duplicates(subset=['company_id'], keep='first')
    removed = original_counts['employee_counts'] - len(employee_counts)
    print(f"‚úÖ employee_counts:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['employee_counts']:,} ‚Üí {len(employee_counts):,} rows\n")
else:
    print(f"‚úÖ employee_counts: Already clean\n")

# 5. Clean Postings (if needed)
if 'job_id' in postings.columns:
    if len(postings) != postings['job_id'].nunique():
        original_counts['postings'] = len(postings)
        postings = postings.drop_duplicates(subset=['job_id'], keep='first')
        removed = original_counts['postings'] - len(postings)
        print(f"‚úÖ postings:")
        print(f"   Removed {removed:,} duplicates")
        print(f"   {original_counts['postings']:,} ‚Üí {len(postings):,} rows\n")
    else:
        print(f"‚úÖ postings: Already clean\n")

# 6. Clean Companies Full (if needed)
if len(companies_full) != companies_full['company_id'].nunique():
    original_counts['companies_full'] = len(companies_full)
    companies_full = companies_full.drop_duplicates(subset=['company_id'], keep='first')
    removed = original_counts['companies_full'] - len(companies_full)
    print(f"‚úÖ companies_full:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['companies_full']:,} ‚Üí {len(companies_full):,} rows\n")
else:
    print(f"‚úÖ companies_full: Already clean\n")

print("=" * 80)
print("‚úÖ DATA CLEANING COMPLETE!")
print("=" * 80)
print()

# Summary
if original_counts:
    total_removed = sum(original_counts[k] - globals()[k].shape[0] if k in globals() else 0 
                       for k in original_counts.keys())
    print(f"üìä Total duplicates removed: {total_removed:,} rows")
    print()
    print("Cleaned datasets:")
    for dataset, original in original_counts.items():
        current = len(globals()[dataset]) if dataset in globals() else 0
        print(f"  - {dataset}: {original:,} ‚Üí {current:,}")
else:
    print("‚úÖ No duplicates found - all datasets were already clean!")

üßπ CLEANING DUPLICATES...

‚úÖ companies_base: Already clean

‚úÖ company_industries: Already clean

‚úÖ company_specialties: Already clean

‚úÖ employee_counts:
   Removed 11,314 duplicates
   35,787 ‚Üí 24,473 rows

‚úÖ postings: Already clean

‚úÖ companies_full:
   Removed 11,314 duplicates
   35,787 ‚Üí 24,473 rows

‚úÖ DATA CLEANING COMPLETE!

üìä Total duplicates removed: 22,628 rows

Cleaned datasets:
  - employee_counts: 35,787 ‚Üí 24,473
  - companies_full: 35,787 ‚Üí 24,473


---
## üìä Step 7: Load Embedding Model & Pre-computed Vectors

In [10]:
print("üß† Loading embedding model...\n")
model = SentenceTransformer(Config.EMBEDDING_MODEL)
embedding_dim = model.get_sentence_embedding_dimension()
print(f"‚úÖ Model loaded: {Config.EMBEDDING_MODEL}")
print(f"üìê Embedding dimension: ‚Ñù^{embedding_dim}\n")

print("üìÇ Loading pre-computed embeddings...")

try:
    # Try to load from processed folder
    cand_vectors = np.load(f'{Config.PROCESSED_PATH}candidate_embeddings.npy')
    comp_vectors = np.load(f'{Config.PROCESSED_PATH}company_embeddings.npy')
    
    print(f"‚úÖ Loaded from {Config.PROCESSED_PATH}")
    print(f"üìä Candidate vectors: {cand_vectors.shape}")
    print(f"üìä Company vectors: {comp_vectors.shape}\n")
    
except FileNotFoundError:
    print("‚ö†Ô∏è  Pre-computed embeddings not found!")
    print("   Embeddings will need to be generated (takes ~5-10 minutes)")
    print("   This is normal if running for the first time.\n")
    
    # You can add embedding generation code here if needed
    # For now, we'll skip to keep notebook clean
    cand_vectors = None
    comp_vectors = None

üß† Loading embedding model...

‚úÖ Model loaded: all-MiniLM-L6-v2
üìê Embedding dimension: ‚Ñù^384

üìÇ Loading pre-computed embeddings...
‚úÖ Loaded from ../processed/
üìä Candidate vectors: (9544, 384)
üìä Company vectors: (35787, 384)



---
## üìä Step 8: Core Matching Function

In [11]:
# ============================================================================
# CORE MATCHING FUNCTION (SAFE VERSION)
# ============================================================================

def find_top_matches(candidate_idx: int, top_k: int = 10) -> list:
    """
    Find top K company matches for a candidate.
    
    SAFE VERSION: Handles index mismatches between embeddings and dataset
    
    Args:
        candidate_idx: Index of candidate in candidates DataFrame
        top_k: Number of top matches to return
    
    Returns:
        List of tuples: [(company_idx, similarity_score), ...]
    """
    
    # Validate candidate index
    if candidate_idx >= len(cand_vectors):
        print(f"‚ùå Candidate index {candidate_idx} out of range")
        return []
    
    # Get candidate vector
    cand_vec = cand_vectors[candidate_idx].reshape(1, -1)
    
    # Calculate similarities with all company vectors
    similarities = cosine_similarity(cand_vec, comp_vectors)[0]
    
    # CRITICAL FIX: Only use indices that exist in companies_full
    max_valid_idx = len(companies_full) - 1
    
    # Truncate similarities to valid range
    valid_similarities = similarities[:max_valid_idx + 1]
    
    # Get top K indices from valid range
    top_indices = np.argsort(valid_similarities)[::-1][:top_k]
    
    # Return (index, score) tuples
    results = [(int(idx), float(valid_similarities[idx])) for idx in top_indices]
    
    return results

# Test function and show diagnostics
print("‚úÖ Safe matching function loaded!")
print(f"\nüìä DIAGNOSTICS:")
print(f"   Candidate vectors: {len(cand_vectors):,}")
print(f"   Company vectors: {len(comp_vectors):,}")
print(f"   Companies dataset: {len(companies_full):,}")

if len(comp_vectors) > len(companies_full):
    print(f"\n‚ö†Ô∏è  INDEX MISMATCH DETECTED!")
    print(f"   Embeddings: {len(comp_vectors):,}")
    print(f"   Dataset: {len(companies_full):,}")
    print(f"   Missing rows: {len(comp_vectors) - len(companies_full):,}")
    print(f"\nüí° CAUSE: Embeddings generated BEFORE deduplication")
    print(f"\nüéØ SOLUTIONS:")
    print(f"   A. Safe functions active (current) ‚úÖ")
    print(f"   B. Regenerate embeddings after dedup")
    print(f"   C. Run collaborative filtering step")
else:
    print(f"\n‚úÖ Embeddings and dataset are aligned!")

‚úÖ Safe matching function loaded!

üìä DIAGNOSTICS:
   Candidate vectors: 9,544
   Company vectors: 35,787
   Companies dataset: 24,473

‚ö†Ô∏è  INDEX MISMATCH DETECTED!
   Embeddings: 35,787
   Dataset: 24,473
   Missing rows: 11,314

üí° CAUSE: Embeddings generated BEFORE deduplication

üéØ SOLUTIONS:
   A. Safe functions active (current) ‚úÖ
   B. Regenerate embeddings after dedup
   C. Run collaborative filtering step


---
## üìä Step 9: Initialize FREE LLM (Hugging Face)

### Get your FREE token: https://huggingface.co/settings/tokens

In [12]:
# Initialize Hugging Face Inference Client (FREE)
if Config.HF_TOKEN:
    try:
        hf_client = InferenceClient(token=Config.HF_TOKEN)
        print("‚úÖ Hugging Face client initialized (FREE)")
        print(f"ü§ñ Model: {Config.LLM_MODEL}")
        print("üí∞ Cost: $0.00 (completely free!)\n")
        LLM_AVAILABLE = True
    except Exception as e:
        print(f"‚ö†Ô∏è  Failed to initialize HF client: {e}")
        LLM_AVAILABLE = False
else:
    print("‚ö†Ô∏è  No Hugging Face token configured")
    print("   LLM features will be disabled")
    print("\nüìù To enable:")
    print("   1. Go to: https://huggingface.co/settings/tokens")
    print("   2. Create a token (free)")
    print("   3. Set: Config.HF_TOKEN = 'your-token-here'\n")
    LLM_AVAILABLE = False
    hf_client = None

def call_llm(prompt: str, max_tokens: int = 1000) -> str:
    """
    Generic LLM call using Hugging Face Inference API (FREE).
    """
    if not LLM_AVAILABLE:
        return "[LLM not available - check .env file for HF_TOKEN]"
    
    try:
        response = hf_client.chat_completion(  # ‚úÖ chat_completion
            messages=[{"role": "user", "content": prompt}],
            model=Config.LLM_MODEL,
            max_tokens=max_tokens,
            temperature=0.7
        )
        return response.choices[0].message.content  # ‚úÖ Extrai conte√∫do
    except Exception as e:
        return f"[Error: {str(e)}]"

print("‚úÖ LLM helper functions ready")

‚úÖ Hugging Face client initialized (FREE)
ü§ñ Model: meta-llama/Llama-3.2-3B-Instruct
üí∞ Cost: $0.00 (completely free!)

‚úÖ LLM helper functions ready


---
## üìä Step 10: Pydantic Schemas for Structured Output

In [13]:
class JobLevelClassification(BaseModel):
    """Job level classification result"""
    level: Literal['Entry', 'Mid', 'Senior', 'Executive']
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str

class SkillsTaxonomy(BaseModel):
    """Structured skills extraction"""
    technical_skills: List[str] = Field(default_factory=list)
    soft_skills: List[str] = Field(default_factory=list)
    certifications: List[str] = Field(default_factory=list)
    languages: List[str] = Field(default_factory=list)

class MatchExplanation(BaseModel):
    """Match reasoning"""
    overall_score: float = Field(ge=0.0, le=1.0)
    match_strengths: List[str]
    skill_gaps: List[str]
    recommendation: str
    fit_summary: str = Field(max_length=200)

print("‚úÖ Pydantic schemas defined")

‚úÖ Pydantic schemas defined


---
## üìä Step 11: Job Level Classification (Zero-Shot)

In [14]:
def classify_job_level_zero_shot(job_description: str) -> Dict:
    """
    Zero-shot job level classification.
    
    Returns classification as: Entry, Mid, Senior, or Executive
    """
    
    prompt = f"""Classify this job posting into ONE seniority level.

Levels:
- Entry: 0-2 years experience, junior roles
- Mid: 3-5 years experience, independent work
- Senior: 6-10 years experience, technical leadership
- Executive: 10+ years, strategic leadership, C-level

Job Posting:
{job_description[:500]}

Return ONLY valid JSON:
{{
    "level": "Entry|Mid|Senior|Executive",
    "confidence": 0.85,
    "reasoning": "Brief explanation"
}}
"""
    
    response = call_llm(prompt)
    
    try:
        # Extract JSON
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        elif '```' in json_str:
            json_str = json_str.split('```')[1].split('```')[0].strip()
        
        # Find JSON in response
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        result = json.loads(json_str)
        return result
    except:
        return {
            "level": "Unknown",
            "confidence": 0.0,
            "reasoning": "Failed to parse response"
        }

# Test if LLM available and data loaded
if LLM_AVAILABLE and len(postings) > 0:
    print("üß™ Testing zero-shot classification...\n")
    sample = postings.iloc[0]['description']
    result = classify_job_level_zero_shot(sample)
    
    print("üìä Classification Result:")
    print(json.dumps(result, indent=2))
else:
    print("‚ö†Ô∏è  Skipped - LLM not available or no data")

üß™ Testing zero-shot classification...

üìä Classification Result:
{
  "level": "Entry",
  "confidence": 0.85,
  "reasoning": "The job posting requires a Marketing Coordinator with some experience in graphic design, indicating a junior role with limited technical leadership responsibilities."
}


---
## üìä Step 12: Few-Shot Learning

In [15]:
def classify_job_level_few_shot(job_description: str) -> Dict:
    """
    Few-shot classification with examples.
    """
    
    prompt = f"""Classify this job posting using examples.

EXAMPLES:

Example 1 (Entry):
"Recent graduate wanted. Python basics. Mentorship provided."
‚Üí Entry level (learning focus, 0-2 years)

Example 2 (Senior):
"5+ years backend. Lead team of 3. System architecture."
‚Üí Senior level (technical leadership, 6-10 years)

Example 3 (Executive):
"CTO position. 15+ years. Define technical strategy."
‚Üí Executive level (C-level, strategic)

NOW CLASSIFY:
{job_description[:500]}

Return JSON:
{{
    "level": "Entry|Mid|Senior|Executive",
    "confidence": 0.0-1.0,
    "reasoning": "Explain"
}}
"""
    
    response = call_llm(prompt)
    
    try:
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        result = json.loads(json_str)
        return result
    except:
        return {"level": "Unknown", "confidence": 0.0, "reasoning": "Parse error"}

# Compare zero-shot vs few-shot
if LLM_AVAILABLE and len(postings) > 0:
    print("üß™ Comparing Zero-Shot vs Few-Shot...\n")
    sample = postings.iloc[0]['description']
    
    zero = classify_job_level_zero_shot(sample)
    few = classify_job_level_few_shot(sample)
    
    print("üìä Comparison:")
    print(f"Zero-shot: {zero['level']} (confidence: {zero['confidence']:.2f})")
    print(f"Few-shot:  {few['level']} (confidence: {few['confidence']:.2f})")
else:
    print("‚ö†Ô∏è  Skipped")

üß™ Comparing Zero-Shot vs Few-Shot...



KeyboardInterrupt: 

---
## üìä Step 13: Structured Skills Extraction

In [None]:
def extract_skills_taxonomy(job_description: str) -> Dict:
    """
    Extract structured skills using LLM + Pydantic validation.
    """
    
    prompt = f"""Extract skills from this job posting.

Job Posting:
{job_description[:800]}

Return ONLY valid JSON:
{{
    "technical_skills": ["Python", "Docker", "AWS"],
    "soft_skills": ["Communication", "Leadership"],
    "certifications": ["AWS Certified"],
    "languages": ["English", "Danish"]
}}
"""
    
    response = call_llm(prompt, max_tokens=800)
    
    try:
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        data = json.loads(json_str)
        # Validate with Pydantic
        validated = SkillsTaxonomy(**data)
        return validated.model_dump()
    except:
        return {
            "technical_skills": [],
            "soft_skills": [],
            "certifications": [],
            "languages": []
        }

# Test extraction
if LLM_AVAILABLE and len(postings) > 0:
    print("üîç Testing skills extraction...\n")
    sample = postings.iloc[0]['description']
    skills = extract_skills_taxonomy(sample)
    
    print("üìä Extracted Skills:")
    print(json.dumps(skills, indent=2))
else:
    print("‚ö†Ô∏è  Skipped")

üîç Testing skills extraction...

üìä Extracted Skills:
{
  "technical_skills": [
    "Adobe Creative Cloud",
    "Microsoft Office Suite"
  ],
  "soft_skills": [
    "Communication",
    "Leadership",
    "Organization",
    "Problem-solving",
    "Time management",
    "Positive attitude",
    "Respect",
    "Responsibility",
    "Proactivity"
  ],
  "certifications": [
    "AWS Certified"
  ],
  "languages": [
    "English"
  ]
}


---
## üìä Step 14: Match Explainability

In [None]:
def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:
    """
    Generate LLM explanation for why candidate matches company.
    """
    
    cand = candidates.iloc[candidate_idx]
    comp = companies_full.iloc[company_idx]
    
    cand_skills = str(cand.get('skills', 'N/A'))[:300]
    cand_exp = str(cand.get('positions', 'N/A'))[:300]
    comp_req = str(comp.get('required_skills', 'N/A'))[:300]
    comp_name = comp.get('name', 'Unknown')
    
    prompt = f"""Explain why this candidate matches this company.

Candidate:
Skills: {cand_skills}
Experience: {cand_exp}

Company: {comp_name}
Requirements: {comp_req}

Similarity Score: {similarity_score:.2f}

Return JSON:
{{
    "overall_score": {similarity_score},
    "match_strengths": ["Top 3-5 matching factors"],
    "skill_gaps": ["Missing skills"],
    "recommendation": "What candidate should do",
    "fit_summary": "One sentence summary"
}}
"""
    
    response = call_llm(prompt, max_tokens=1000)
    
    try:
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        data = json.loads(json_str)
        return data
    except:
        return {
            "overall_score": similarity_score,
            "match_strengths": ["Unable to generate"],
            "skill_gaps": [],
            "recommendation": "Review manually",
            "fit_summary": f"Match score: {similarity_score:.2f}"
        }

# Test explainability
if LLM_AVAILABLE and cand_vectors is not None and len(candidates) > 0:
    print("üí° Testing match explainability...\n")
    matches = find_top_matches(0, top_k=1)
    if matches:
        comp_idx, score = matches[0]
        explanation = explain_match(0, comp_idx, score)
        
        print("üìä Match Explanation:")
        print(json.dumps(explanation, indent=2))
else:
    print("‚ö†Ô∏è  Skipped - requirements not met")

üí° Testing match explainability...

üìä Match Explanation:
{
  "overall_score": 0.7028058171272278,
  "match_strengths": [
    "Top 3-5 matching factors"
  ],
  "skill_gaps": [
    "Missing skills"
  ],
  "recommendation": "What candidate should do",
  "fit_summary": "This candidate has a strong technical background, with skills relevant to big data and analytics, but may need to improve their skills to align with TeachTown's specific needs."
}


---
## üìä Step 16: Detailed Match Visualization

In [None]:
# ============================================================================
# üîç DETAILED MATCH EXAMPLE
# ============================================================================

def show_detailed_match_example(candidate_idx=0, top_k=5):
    print("üîç DETAILED MATCH ANALYSIS")
    print("=" * 100)
    
    if candidate_idx >= len(candidates):
        print(f"‚ùå ERROR: Candidate {candidate_idx} out of range")
        return None
    
    cand = candidates.iloc[candidate_idx]
    
    print(f"\nüéØ CANDIDATE #{candidate_idx}")
    print(f"Resume ID: {cand.get('Resume_ID', 'N/A')}")
    print(f"Category: {cand.get('Category', 'N/A')}")
    print(f"Skills: {str(cand.get('skills', 'N/A'))[:150]}...\n")
    
    matches = find_top_matches(candidate_idx, top_k=top_k)
    
    print(f"üîó TOP {len(matches)} MATCHES:\n")
    
    for rank, (comp_idx, score) in enumerate(matches, 1):
        if comp_idx >= len(companies_full):
            continue
        
        company = companies_full.iloc[comp_idx]
        print(f"#{rank}. {company.get('name', 'N/A')} (Score: {score:.4f})")
        print(f"    Industries: {str(company.get('industries_list', 'N/A'))[:60]}...")
    
    print("\n" + "=" * 100)
    return matches

# Test
show_detailed_match_example(candidate_idx=0, top_k=5)

üîç DETAILED MATCH ANALYSIS

üéØ CANDIDATE #0
Resume ID: N/A
Category: N/A
Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++'...

üîó TOP 5 MATCHES:

#1. TeachTown (Score: 0.7028)
    Industries: E-Learning Providers...
#2. Wolverine Power Systems (Score: 0.7026)
    Industries: Renewable Energy Semiconductor Manufacturing...
#3. Mariner (Score: 0.7010)
    Industries: Financial Services...
#4. Primavera School (Score: 0.6827)
    Industries: Education Administration Programs...
#5. OM1, Inc. (Score: 0.6776)
    Industries: Pharmaceutical Manufacturing...



[(9418, 0.7028058171272278),
 (9417, 0.7025721669197083),
 (9416, 0.7010321021080017),
 (13786, 0.6826589107513428),
 (16864, 0.6776158213615417)]

---
## üìä Step 17: Bridging Concept Analysis

In [None]:
# ============================================================================
# üåâ BRIDGING CONCEPT ANALYSIS
# ============================================================================

def show_bridging_concept_analysis():
    print("üåâ THE BRIDGING CONCEPT")
    print("=" * 90)
    
    companies_with = companies_full[companies_full['required_skills'] != '']
    companies_without = companies_full[companies_full['required_skills'] == '']
    
    print(f"\nüìä DATA REALITY:")
    print(f"   Total companies: {len(companies_full):,}")
    print(f"   WITH postings: {len(companies_with):,} ({len(companies_with)/len(companies_full)*100:.1f}%)")
    print(f"   WITHOUT postings: {len(companies_without):,}\n")
    
    print("üéØ THE PROBLEM:")
    print("   Companies: 'We are in TECH INDUSTRY'")
    print("   Candidates: 'I know PYTHON, AWS'")
    print("   ‚Üí Different languages! üö´\n")
    
    print("üåâ THE SOLUTION (BRIDGING):")
    print("   1. Extract from postings: 'Need PYTHON developers'")
    print("   2. Enrich company profile with skills")
    print("   3. Now both speak SKILLS LANGUAGE! ‚úÖ\n")
    
    print("=" * 90)
    return companies_with, companies_without

# Test
show_bridging_concept_analysis()

üåâ THE BRIDGING CONCEPT

üìä DATA REALITY:
   Total companies: 24,473
   WITH postings: 0 (0.0%)
   WITHOUT postings: 24,473

üéØ THE PROBLEM:
   Companies: 'We are in TECH INDUSTRY'
   Candidates: 'I know PYTHON, AWS'
   ‚Üí Different languages! üö´

üåâ THE SOLUTION (BRIDGING):
   1. Extract from postings: 'Need PYTHON developers'
   2. Enrich company profile with skills
   3. Now both speak SKILLS LANGUAGE! ‚úÖ



(Empty DataFrame
 Columns: [company_id, name, description, company_size, state, country, city, zip_code, address, url, industries_list, specialties_list, employee_count, follower_count, time_recorded, posted_job_titles, posted_descriptions, required_skills, experience_levels]
 Index: [],
       company_id                               name  \
 0           1009                                IBM   
 4           1016                      GE HealthCare   
 14          1025         Hewlett Packard Enterprise   
 18          1028                             Oracle   
 23          1033                          Accenture   
 ...          ...                                ...   
 35782  103463217                       JRC Services   
 35783  103466352             Centent Consulting LLC   
 35784  103467540  Kings and Queens Productions, LLC   
 35785  103468936                           WebUnite   
 35786  103472979                            BlackVe   
 
                                     

---
## üìä Step 18: Export Results to CSV

In [None]:
# ============================================================================
# üíæ EXPORT MATCHES TO CSV
# ============================================================================

def export_matches_to_csv(num_candidates=100, top_k=10):
    print(f"üíæ Exporting {num_candidates} candidates (top {top_k} each)...\n")
    
    results = []
    
    for i in range(min(num_candidates, len(candidates))):
        if i % 50 == 0:
            print(f"   Processing {i+1}/{num_candidates}...")
        
        matches = find_top_matches(i, top_k=top_k)
        cand = candidates.iloc[i]
        
        for rank, (comp_idx, score) in enumerate(matches, 1):
            if comp_idx >= len(companies_full):
                continue
            
            company = companies_full.iloc[comp_idx]
            
            results.append({
                'candidate_id': i,
                'candidate_category': cand.get('Category', 'N/A'),
                'company_id': company.get('company_id', 'N/A'),
                'company_name': company.get('name', 'N/A'),
                'match_rank': rank,
                'similarity_score': round(float(score), 4)
            })
    
    results_df = pd.DataFrame(results)
    output_file = f'{Config.RESULTS_PATH}hrhub_matches.csv'
    results_df.to_csv(output_file, index=False)
    
    print(f"\n‚úÖ Exported {len(results_df):,} matches")
    print(f"üìÑ File: {output_file}\n")
    
    return results_df

# Export sample
matches_df = export_matches_to_csv(num_candidates=50, top_k=5)

üíæ Exporting 50 candidates (top 5 each)...

   Processing 1/50...

‚úÖ Exported 250 matches
üìÑ File: ../results/hrhub_matches.csv



---
## üìä Interactive Visualization 1: t-SNE Vector Space

Project embeddings from ‚Ñù¬≥‚Å∏‚Å¥ ‚Üí ‚Ñù¬≤ to visualize candidates and companies

In [None]:
# ============================================================================
# üé® T-SNE VECTOR SPACE VISUALIZATION
# ============================================================================

from sklearn.manifold import TSNE

print("üé® VECTOR SPACE VISUALIZATION\n")
print("=" * 70)

# Sample for visualization
n_cand_viz = min(500, len(candidates))
n_comp_viz = min(2000, len(companies_full))

print(f"üìä Visualizing:")
print(f"   ‚Ä¢ {n_cand_viz} candidates")
print(f"   ‚Ä¢ {n_comp_viz} companies")
print(f"   ‚Ä¢ From ‚Ñù^384 ‚Üí ‚Ñù¬≤ (t-SNE)\n")

# Sample vectors
cand_sample = cand_vectors[:n_cand_viz]
comp_sample = comp_vectors[:n_comp_viz]
all_vectors = np.vstack([cand_sample, comp_sample])

print("üîÑ Running t-SNE (2-3 minutes)...")
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    n_iter=1000
)

vectors_2d = tsne.fit_transform(all_vectors)
cand_2d = vectors_2d[:n_cand_viz]
comp_2d = vectors_2d[n_cand_viz:]

print("\n‚úÖ t-SNE complete!")

üé® VECTOR SPACE VISUALIZATION

üìä Visualizing:
   ‚Ä¢ 500 candidates
   ‚Ä¢ 2000 companies
   ‚Ä¢ From ‚Ñù^384 ‚Üí ‚Ñù¬≤ (t-SNE)

üîÑ Running t-SNE (2-3 minutes)...

‚úÖ t-SNE complete!


In [None]:
# Create interactive plot
fig = go.Figure()

# Companies (red)
fig.add_trace(go.Scatter(
    x=comp_2d[:, 0],
    y=comp_2d[:, 1],
    mode='markers',
    name='Companies',
    marker=dict(size=6, color='#ff6b6b', opacity=0.6),
    text=[f"Company: {companies_full.iloc[i].get('name', 'N/A')[:30]}" 
          for i in range(n_comp_viz)],
    hovertemplate='<b>%{text}</b><extra></extra>'
))

# Candidates (green)
fig.add_trace(go.Scatter(
    x=cand_2d[:, 0],
    y=cand_2d[:, 1],
    mode='markers',
    name='Candidates',
    marker=dict(
        size=10,
        color='#00ff00',
        opacity=0.8,
        line=dict(width=1, color='white')
    ),
    text=[f"Candidate {i}" for i in range(n_cand_viz)],
    hovertemplate='<b>%{text}</b><extra></extra>'
))

fig.update_layout(
    title='Vector Space: Candidates & Companies (Enriched with Postings)',
    xaxis_title='Dimension 1',
    yaxis_title='Dimension 2',
    width=1200,
    height=800,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white')
)

fig.show()

print("\n‚úÖ Visualization complete!")
print("üí° If green & red OVERLAP ‚Üí Alignment worked!")


‚úÖ Visualization complete!
üí° If green & red OVERLAP ‚Üí Alignment worked!


---
## üìä Interactive Visualization 2: Highlighted Match Network

Show candidate and their top matches with connection lines

In [None]:
# ============================================================================
# üîç HIGHLIGHTED MATCH NETWORK
# ============================================================================

target_candidate = 0

print(f"üîç Analyzing Candidate #{target_candidate}...\n")

matches = find_top_matches(target_candidate, top_k=10)
match_indices = [comp_idx for comp_idx, score in matches if comp_idx < n_comp_viz]

# Create highlighted plot
fig2 = go.Figure()

# All companies (background)
fig2.add_trace(go.Scatter(
    x=comp_2d[:, 0],
    y=comp_2d[:, 1],
    mode='markers',
    name='All Companies',
    marker=dict(size=4, color='#ff6b6b', opacity=0.3),
    showlegend=True
))

# Top matches (highlighted)
if match_indices:
    match_positions = comp_2d[match_indices]
    fig2.add_trace(go.Scatter(
        x=match_positions[:, 0],
        y=match_positions[:, 1],
        mode='markers',
        name='Top Matches',
        marker=dict(
            size=15,
            color='#ff0000',
            line=dict(width=2, color='white')
        ),
        text=[f"Match #{i+1}: {companies_full.iloc[match_indices[i]].get('name', 'N/A')[:30]}<br>Score: {matches[i][1]:.3f}" 
              for i in range(len(match_indices))],
        hovertemplate='<b>%{text}</b><extra></extra>'
    ))

# Target candidate (star)
fig2.add_trace(go.Scatter(
    x=[cand_2d[target_candidate, 0]],
    y=[cand_2d[target_candidate, 1]],
    mode='markers',
    name=f'Candidate #{target_candidate}',
    marker=dict(
        size=25,
        color='#00ff00',
        symbol='star',
        line=dict(width=3, color='white')
    )
))

# Connection lines (top 5)
for i, match_idx in enumerate(match_indices[:5]):
    fig2.add_trace(go.Scatter(
        x=[cand_2d[target_candidate, 0], comp_2d[match_idx, 0]],
        y=[cand_2d[target_candidate, 1], comp_2d[match_idx, 1]],
        mode='lines',
        line=dict(color='yellow', width=1, dash='dot'),
        opacity=0.5,
        showlegend=False
    ))

fig2.update_layout(
    title=f'Candidate #{target_candidate} and Top Matches',
    xaxis_title='Dimension 1',
    yaxis_title='Dimension 2',
    width=1200,
    height=800,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white')
)

fig2.show()

print("\n‚úÖ Highlighted visualization created!")
print(f"   ‚≠ê Green star = Candidate #{target_candidate}")
print(f"   üî¥ Red dots = Top matches")
print(f"   üíõ Yellow lines = Connections")

üîç Analyzing Candidate #0...




‚úÖ Highlighted visualization created!
   ‚≠ê Green star = Candidate #0
   üî¥ Red dots = Top matches
   üíõ Yellow lines = Connections


---
## üåê Interactive Visualization 3: Network Graph (PyVis)

Interactive network showing candidate-company connections with nodes & edges

In [None]:
# ============================================================================
# üåê NETWORK GRAPH WITH PYVIS
# ============================================================================

from pyvis.network import Network
import webbrowser
import os

print("üåê Creating interactive network graph...\n")

target_candidate = 0
top_k_network = 10

# Get matches
matches = find_top_matches(target_candidate, top_k=top_k_network)

# Create network
net = Network(
    height='800px',
    width='100%',
    bgcolor='#1a1a1a',
    font_color='white',
    directed=False
)

# Configure physics
net.barnes_hut(
    gravity=-5000,
    central_gravity=0.3,
    spring_length=100,
    spring_strength=0.01
)

# Add candidate node (center)
cand = candidates.iloc[target_candidate]
cand_label = f"Candidate #{target_candidate}"
net.add_node(
    f'cand_{target_candidate}',
    label=cand_label,
    title=f"{cand.get('Category', 'N/A')}<br>Skills: {str(cand.get('skills', 'N/A'))[:100]}",
    color='#00ff00',
    size=40,
    shape='star'
)

# Add company nodes + edges
for rank, (comp_idx, score) in enumerate(matches, 1):
    if comp_idx >= len(companies_full):
        continue
    
    company = companies_full.iloc[comp_idx]
    comp_name = company.get('name', f'Company {comp_idx}')[:30]
    
    # Color by score
    if score > 0.7:
        color = '#ff0000'  # Red (strong match)
    elif score > 0.5:
        color = '#ff6b6b'  # Light red (good match)
    else:
        color = '#ffaaaa'  # Pink (weak match)
    
    # Add company node
    net.add_node(
        f'comp_{comp_idx}',
        label=f"#{rank}. {comp_name}",
        title=f"Score: {score:.3f}<br>Industries: {str(company.get('industries_list', 'N/A'))[:50]}<br>Required: {str(company.get('required_skills', 'N/A'))[:100]}",
        color=color,
        size=20 + (score * 20)  # Size by score
    )
    
    # Add edge
    net.add_edge(
        f'cand_{target_candidate}',
        f'comp_{comp_idx}',
        value=float(score),
        title=f"Similarity: {score:.3f}",
        color='yellow'
    )

# Save
output_file = f'{Config.RESULTS_PATH}network_graph.html'
net.save_graph(output_file)

print(f"‚úÖ Network graph created!")
print(f"üìÑ Saved: {output_file}")
print(f"\nüí° LEGEND:")
print(f"   ‚≠ê Green star = Candidate #{target_candidate}")
print(f"   üî¥ Red nodes = Companies (size = match score)")
print(f"   üíõ Yellow edges = Connections")
print(f"\n‚ÑπÔ∏è  Hover over nodes to see details")
print(f"   Drag nodes to rearrange")
print(f"   Zoom with mouse wheel\n")

# Display in notebook
from IPython.display import IFrame
IFrame(output_file, width=1000, height=800)

üåê Creating interactive network graph...

‚úÖ Network graph created!
üìÑ Saved: ../results/network_graph.html

üí° LEGEND:
   ‚≠ê Green star = Candidate #0
   üî¥ Red nodes = Companies (size = match score)
   üíõ Yellow edges = Connections

‚ÑπÔ∏è  Hover over nodes to see details
   Drag nodes to rearrange
   Zoom with mouse wheel



### üìä Network Node Data

Detailed information about nodes and connections

In [None]:
# ============================================================================
# DISPLAY NODE DATA
# ============================================================================

print("üìä NETWORK DATA SUMMARY")
print("=" * 80)
print(f"\nTotal nodes: {1 + len(matches)}")
print(f"   - 1 candidate node (green star)")
print(f"   - {len(matches)} company nodes (red circles)")
print(f"\nTotal edges: {len(matches)}")
print(f"\n" + "=" * 80)

# Show node details
print(f"\nüéØ CANDIDATE NODE:")
print(f"   ID: cand_{target_candidate}")
print(f"   Category: {cand.get('Category', 'N/A')}")
print(f"   Skills: {str(cand.get('skills', 'N/A'))[:100]}...")

print(f"\nüè¢ COMPANY NODES (Top 5):")
for rank, (comp_idx, score) in enumerate(matches[:5], 1):
    if comp_idx < len(companies_full):
        company = companies_full.iloc[comp_idx]
        print(f"\n   #{rank}. {company.get('name', 'N/A')[:40]}")
        print(f"       ID: comp_{comp_idx}")
        print(f"       Score: {score:.4f}")
        print(f"       Industries: {str(company.get('industries_list', 'N/A'))[:60]}...")

print(f"\n" + "=" * 80)

üìä NETWORK DATA SUMMARY

Total nodes: 11
   - 1 candidate node (green star)
   - 10 company nodes (red circles)

Total edges: 10


üéØ CANDIDATE NODE:
   ID: cand_0
   Category: N/A
   Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', ...

üè¢ COMPANY NODES (Top 5):

   #1. TeachTown
       ID: comp_9418
       Score: 0.7028
       Industries: E-Learning Providers...

   #2. Wolverine Power Systems
       ID: comp_9417
       Score: 0.7026
       Industries: Renewable Energy Semiconductor Manufacturing...

   #3. Mariner
       ID: comp_9416
       Score: 0.7010
       Industries: Financial Services...

   #4. Primavera School
       ID: comp_13786
       Score: 0.6827
       Industries: Education Administration Programs...

   #5. OM1, Inc.
       ID: comp_16864
       Score: 0.6776
       Industries: Pharmaceutical Manufacturing...



---
## üîç Visualization 4: Display Node Data

Inspect detailed information about candidates and companies

In [None]:
# ============================================================================
# DISPLAY NODE DATA - See what's behind the graph
# ============================================================================

def display_node_data(node_id):
    print("=" * 80)
    
    if node_id.startswith('C'):
        # CANDIDATE
        cand_idx = int(node_id[1:])
        
        if cand_idx >= len(candidates):
            print(f"‚ùå Candidate {cand_idx} not found!")
            return
        
        candidate = candidates.iloc[cand_idx]
        
        print(f"üü¢ CANDIDATE #{cand_idx}")
        print("=" * 80)
        print(f"\nüìä KEY INFORMATION:\n")
        print(f"Resume ID: {candidate.get('Resume_ID', 'N/A')}")
        print(f"Category: {candidate.get('Category', 'N/A')}")
        print(f"Skills: {str(candidate.get('skills', 'N/A'))[:200]}")
        print(f"Career Objective: {str(candidate.get('career_objective', 'N/A'))[:200]}")
        
    elif node_id.startswith('J'):
        # COMPANY
        comp_idx = int(node_id[1:])
        
        if comp_idx >= len(companies_full):
            print(f"‚ùå Company {comp_idx} not found!")
            return
        
        company = companies_full.iloc[comp_idx]
        
        print(f"üî¥ COMPANY #{comp_idx}")
        print("=" * 80)
        print(f"\nüìä COMPANY INFORMATION:\n")
        print(f"Name: {company.get('name', 'N/A')}")
        print(f"Industries: {str(company.get('industries_list', 'N/A'))[:200]}")
        print(f"Required Skills: {str(company.get('required_skills', 'N/A'))[:200]}")
        print(f"Posted Jobs: {str(company.get('posted_job_titles', 'N/A'))[:200]}")
    
    print("\n" + "=" * 80 + "\n")

def display_node_with_connections(node_id, top_k=10):
    display_node_data(node_id)
    
    if node_id.startswith('C'):
        cand_idx = int(node_id[1:])
        
        print(f"üéØ TOP {top_k} MATCHES:")
        print("=" * 80)
        
        matches = find_top_matches(cand_idx, top_k=top_k)
        
        # FIXED: Validate indices before accessing
        valid_matches = 0
        for rank, (comp_idx, score) in enumerate(matches, 1):
            # Check if index is valid
            if comp_idx >= len(companies_full):
                print(f"‚ö†Ô∏è  Match #{rank}: Index {comp_idx} out of range (skipping)")
                continue
            
            company = companies_full.iloc[comp_idx]
            print(f"#{rank}. {company.get('name', 'N/A')[:40]} (Score: {score:.4f})")
            valid_matches += 1
        
        if valid_matches == 0:
            print("‚ö†Ô∏è  No valid matches found (all indices out of bounds)")
            print("\nüí° SOLUTION: Regenerate embeddings after deduplication!")
        
        print("\n" + "=" * 80)

# Example usage
display_node_with_connections('C0', top_k=5)

üü¢ CANDIDATE #0

üìä KEY INFORMATION:

Resume ID: N/A
Category: N/A
Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++', 'Data Structures', 'DBMS', 'RDBMS', 'Informatica
Career Objective: Big data analytics working and database warehouse manager with robust experience in handling all kinds of data. I have also used multiple cloud infrastructure services and am well acquainted with them


üéØ TOP 5 MATCHES:
#1. TeachTown (Score: 0.7028)
#2. Wolverine Power Systems (Score: 0.7026)
#3. Mariner (Score: 0.7010)
#4. Primavera School (Score: 0.6827)
#5. OM1, Inc. (Score: 0.6776)



---
## üï∏Ô∏è Visualization 5: NetworkX Graph

Network graph using NetworkX + Plotly with force-directed layout

In [None]:
# ============================================================================
# NETWORK GRAPH WITH NETWORKX + PLOTLY
# ============================================================================

import networkx as nx

print("üï∏Ô∏è  Creating NETWORK GRAPH...\n")

# Create graph
G = nx.Graph()

# Sample
n_cand_sample = min(20, len(candidates))
top_k_per_cand = 5

print(f"üìä Network size:")
print(f"   ‚Ä¢ {n_cand_sample} candidates")
print(f"   ‚Ä¢ {top_k_per_cand} companies per candidate\n")

# Add nodes + edges
companies_in_graph = set()

for i in range(n_cand_sample):
    G.add_node(f"C{i}", node_type='candidate', label=f"C{i}")
    
    matches = find_top_matches(i, top_k=top_k_per_cand)
    
    for comp_idx, score in matches:
        comp_id = f"J{comp_idx}"
        
        if comp_id not in companies_in_graph:
            company_name = companies_full.iloc[comp_idx].get('name', 'N/A')[:20]
            G.add_node(comp_id, node_type='company', label=company_name)
            companies_in_graph.add(comp_id)
        
        G.add_edge(f"C{i}", comp_id, weight=float(score))

print(f"‚úÖ Network created!")
print(f"   Nodes: {G.number_of_nodes()}")
print(f"   Edges: {G.number_of_edges()}\n")

# Calculate layout
print("üîÑ Calculating layout...")
pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
print("‚úÖ Layout done!\n")

# Create edge traces
edge_trace = []
for edge in G.edges(data=True):
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    weight = edge[2]['weight']
    
    edge_trace.append(go.Scatter(
        x=[x0, x1, None],
        y=[y0, y1, None],
        mode='lines',
        line=dict(width=weight*3, color='rgba(255,255,255,0.3)'),
        hoverinfo='none',
        showlegend=False
    ))

# Candidate nodes
cand_nodes = [n for n, d in G.nodes(data=True) if d['node_type']=='candidate']
cand_x = [pos[n][0] for n in cand_nodes]
cand_y = [pos[n][1] for n in cand_nodes]
cand_labels = [G.nodes[n]['label'] for n in cand_nodes]

candidate_trace = go.Scatter(
    x=cand_x, y=cand_y,
    mode='markers+text',
    name='Candidates',
    marker=dict(size=25, color='#00ff00', line=dict(width=2, color='white')),
    text=cand_labels,
    textposition='top center',
    hovertemplate='<b>%{text}</b><extra></extra>'
)

# Company nodes
comp_nodes = [n for n, d in G.nodes(data=True) if d['node_type']=='company']
comp_x = [pos[n][0] for n in comp_nodes]
comp_y = [pos[n][1] for n in comp_nodes]
comp_labels = [G.nodes[n]['label'] for n in comp_nodes]

company_trace = go.Scatter(
    x=comp_x, y=comp_y,
    mode='markers+text',
    name='Companies',
    marker=dict(size=15, color='#ff6b6b', symbol='square'),
    text=comp_labels,
    textposition='top center',
    hovertemplate='<b>%{text}</b><extra></extra>'
)

# Create figure
fig = go.Figure(data=edge_trace + [candidate_trace, company_trace])

fig.update_layout(
    title='Network Graph: Candidates ‚Üî Companies',
    showlegend=True,
    width=1400, height=900,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white'),
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)

fig.show()

print("‚úÖ NetworkX graph created!")
print("   üü¢ Green = Candidates")
print("   üî¥ Red = Companies")
print("   Lines = Connections (thicker = stronger)\n")

üï∏Ô∏è  Creating NETWORK GRAPH...

üìä Network size:
   ‚Ä¢ 20 candidates
   ‚Ä¢ 5 companies per candidate

‚úÖ Network created!
   Nodes: 74
   Edges: 100

üîÑ Calculating layout...
‚úÖ Layout done!



‚úÖ NetworkX graph created!
   üü¢ Green = Candidates
   üî¥ Red = Companies
   Lines = Connections (thicker = stronger)



---
## üêõ DEBUG: Why aren't candidates & companies overlapping?

Investigating the embedding space alignment

In [None]:
# ============================================================================
# DEBUG: CHECK EMBEDDING ALIGNMENT
# ============================================================================

print("üêõ DEBUGGING EMBEDDING SPACE")
print("=" * 80)

# 1. Check if vectors loaded correctly
print(f"\n1Ô∏è‚É£ VECTOR SHAPES:")
print(f"   Candidates: {cand_vectors.shape}")
print(f"   Companies: {comp_vectors.shape}")

# 2. Check vector norms
print(f"\n2Ô∏è‚É£ VECTOR NORMS (should be ~1.0 if normalized):")
cand_norms = np.linalg.norm(cand_vectors, axis=1)
comp_norms = np.linalg.norm(comp_vectors, axis=1)
print(f"   Candidates: mean={cand_norms.mean():.4f}, min={cand_norms.min():.4f}, max={cand_norms.max():.4f}")
print(f"   Companies: mean={comp_norms.mean():.4f}, min={comp_norms.min():.4f}, max={comp_norms.max():.4f}")

# 3. Sample similarity
print(f"\n3Ô∏è‚É£ SAMPLE SIMILARITIES:")
sample_cand = 0
matches = find_top_matches(sample_cand, top_k=5)
print(f"   Candidate #{sample_cand} top 5 matches:")
for rank, (comp_idx, score) in enumerate(matches, 1):
    print(f"      #{rank}. Company {comp_idx}: {score:.4f}")

# 4. Check text representations
print(f"\n4Ô∏è‚É£ TEXT REPRESENTATION SAMPLES:")
print(f"\n   üìã CANDIDATE #{sample_cand}:")
cand = candidates.iloc[sample_cand]
print(f"      Skills: {str(cand.get('skills', 'N/A'))[:100]}")
print(f"      Category: {cand.get('Category', 'N/A')}")

top_company_idx = matches[0][0]
print(f"\n   üè¢ TOP MATCH COMPANY #{top_company_idx}:")
company = companies_full.iloc[top_company_idx]
print(f"      Name: {company.get('name', 'N/A')}")
print(f"      Required Skills: {str(company.get('required_skills', 'N/A'))[:100]}")
print(f"      Industries: {str(company.get('industries_list', 'N/A'))[:100]}")

# 5. Check if postings enrichment worked
print(f"\n5Ô∏è‚É£ POSTINGS ENRICHMENT CHECK:")
companies_with_postings = companies_full[companies_full['required_skills'] != ''].shape[0]
companies_without = companies_full[companies_full['required_skills'] == ''].shape[0]
print(f"   WITH postings: {companies_with_postings:,} ({companies_with_postings/len(companies_full)*100:.1f}%)")
print(f"   WITHOUT postings: {companies_without:,}")

# 6. HYPOTHESIS
print(f"\n‚ùì HYPOTHESIS:")
if companies_without > companies_with_postings:
    print(f"   ‚ö†Ô∏è  Most companies DON'T have postings!")
    print(f"   ‚ö†Ô∏è  They only have: industries, specialties, description")
    print(f"   ‚ö†Ô∏è  This creates DIFFERENT language than candidates")
    print(f"\n   üí° SOLUTION:")
    print(f"      Option A: Filter to only companies WITH postings")
    print(f"      Option B: Use LLM to translate industries ‚Üí skills")
else:
    print(f"   ‚úÖ Most companies have postings")
    print(f"   ‚ùì Need to check if embeddings were generated AFTER enrichment")

print(f"\n" + "=" * 80)

üêõ DEBUGGING EMBEDDING SPACE

1Ô∏è‚É£ VECTOR SHAPES:
   Candidates: (9544, 384)
   Companies: (35787, 384)

2Ô∏è‚É£ VECTOR NORMS (should be ~1.0 if normalized):
   Candidates: mean=1.0000, min=1.0000, max=1.0000
   Companies: mean=1.0000, min=1.0000, max=1.0000

3Ô∏è‚É£ SAMPLE SIMILARITIES:
   Candidate #0 top 5 matches:
      #1. Company 9418: 0.7028
      #2. Company 9417: 0.7026
      #3. Company 9416: 0.7010
      #4. Company 13786: 0.6827
      #5. Company 16864: 0.6776

4Ô∏è‚É£ TEXT REPRESENTATION SAMPLES:

   üìã CANDIDATE #0:
      Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 
      Category: N/A

   üè¢ TOP MATCH COMPANY #9418:
      Name: TeachTown
      Required Skills: 
      Industries: E-Learning Providers

5Ô∏è‚É£ POSTINGS ENRICHMENT CHECK:
   WITH postings: 0 (0.0%)
   WITHOUT postings: 24,473

‚ùì HYPOTHESIS:
   ‚ö†Ô∏è  Most companies DON'T have postings!
   ‚ö†Ô∏è  They only have: industries, specialti

---
## üìä Step 19: Summary

### What We Built

In [None]:
print("="*70)
print("üéØ HRHUB v2.1 - SUMMARY")
print("="*70)
print("")
print("‚úÖ IMPLEMENTED:")
print("  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)")
print("  2. Few-Shot Learning with Examples")
print("  3. Structured Skills Extraction (Pydantic schemas)")
print("  4. Match Explainability (LLM-generated reasoning)")
print("  5. FREE LLM Integration (Hugging Face)")
print("  6. Flexible Data Loading (Upload OR Google Drive)")
print("")
print("üí∞ COST: $0.00 (completely free!)")
print("")
print("üìà COURSE ALIGNMENT:")
print("  ‚úÖ LLMs for structured output")
print("  ‚úÖ Pydantic schemas")
print("  ‚úÖ Classification pipelines")
print("  ‚úÖ Zero-shot & few-shot learning")
print("  ‚úÖ JSON extraction")
print("  ‚úÖ Transformer architecture (embeddings)")
print("  ‚úÖ API deployment strategies")
print("")
print("="*70)
print("üöÄ READY TO MOVE TO VS CODE!")
print("="*70)

üéØ HRHUB v2.1 - SUMMARY

‚úÖ IMPLEMENTED:
  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)
  2. Few-Shot Learning with Examples
  3. Structured Skills Extraction (Pydantic schemas)
  4. Match Explainability (LLM-generated reasoning)
  5. FREE LLM Integration (Hugging Face)
  6. Flexible Data Loading (Upload OR Google Drive)

üí∞ COST: $0.00 (completely free!)

üìà COURSE ALIGNMENT:
  ‚úÖ LLMs for structured output
  ‚úÖ Pydantic schemas
  ‚úÖ Classification pipelines
  ‚úÖ Zero-shot & few-shot learning
  ‚úÖ JSON extraction
  ‚úÖ Transformer architecture (embeddings)
  ‚úÖ API deployment strategies

üöÄ READY TO MOVE TO VS CODE!


---
## üéØ Step 7.5: Collaborative Filtering for Companies

**THE GENIUS SOLUTION!**

Companies WITHOUT postings can inherit skills from similar companies WITH postings!

Like Netflix recommendations:
- Company A (no postings) similar to Company B (has postings)
- ‚Üí Company A inherits Company B's required skills!

In [None]:
# ============================================================================
# COLLABORATIVE FILTERING: Companies without postings
# ============================================================================

print("üéØ COLLABORATIVE FILTERING FOR COMPANIES")
print("=" * 80)
print("\nLike Netflix: Similar companies ‚Üí Similar skills needed!\n")

# Step 1: Separate companies
companies_with_postings = companies_full[companies_full['required_skills'] != ''].copy()
companies_without_postings = companies_full[companies_full['required_skills'] == ''].copy()

print(f"üìä DATA SPLIT:")
print(f"   WITH postings: {len(companies_with_postings):,} companies")
print(f"   WITHOUT postings: {len(companies_without_postings):,} companies")
print(f"\nüí° Goal: Infer skills for {len(companies_without_postings):,} companies\n")

# Step 2: Build company profile vectors (BEFORE postings)
# Using ONLY: industries, specialties, employee_count, description
print("üîß Building company profile vectors...")

def build_company_profile_text(row):
    """Build text representation WITHOUT postings data"""
    parts = []
    
    if row.get('name'):
        parts.append(f"Company: {row['name']}")
    
    if row.get('description'):
        parts.append(f"Description: {row['description']}")
    
    if row.get('industries_list'):
        parts.append(f"Industries: {row['industries_list']}")
    
    if row.get('specialties_list'):
        parts.append(f"Specialties: {row['specialties_list']}")
    
    if row.get('employee_count'):
        parts.append(f"Size: {row['employee_count']} employees")
    
    return ' '.join(parts)

# Generate profile embeddings
with_postings_profiles = companies_with_postings.apply(build_company_profile_text, axis=1).tolist()
without_postings_profiles = companies_without_postings.apply(build_company_profile_text, axis=1).tolist()

print(f"   Encoding {len(with_postings_profiles):,} companies WITH postings...")
with_postings_embeddings = model.encode(
    with_postings_profiles,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

print(f"   Encoding {len(without_postings_profiles):,} companies WITHOUT postings...")
without_postings_embeddings = model.encode(
    without_postings_profiles,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

print(f"\n‚úÖ Profile embeddings created!")
print(f"   Shape WITH: {with_postings_embeddings.shape}")
print(f"   Shape WITHOUT: {without_postings_embeddings.shape}\n")

üéØ COLLABORATIVE FILTERING FOR COMPANIES

Like Netflix: Similar companies ‚Üí Similar skills needed!

üìä DATA SPLIT:
   WITH postings: 0 companies
   WITHOUT postings: 24,473 companies

üí° Goal: Infer skills for 24,473 companies

üîß Building company profile vectors...
   Encoding 0 companies WITH postings...


Batches: 0it [00:00, ?it/s]

   Encoding 24,473 companies WITHOUT postings...



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 765/765 [12:15<00:00,  1.04it/s]


‚úÖ Profile embeddings created!
   Shape WITH: (0,)
   Shape WITHOUT: (24473, 384)






In [None]:
# ============================================================================
# STEP 3: Find Similar Companies & Inherit Skills
# ============================================================================

print("üîç Finding similar companies for skill inheritance...\n")

# For each company WITHOUT postings, find top-K similar WITH postings
TOP_K_SIMILAR = 5  # Use top 5 similar companies

print(f"üìä Method: Top-{TOP_K_SIMILAR} Collaborative Filtering\n")

inferred_skills = []
inferred_titles = []
inferred_levels = []

# Calculate similarities (batch processing)
print("‚öôÔ∏è  Calculating company-to-company similarities...")
similarities = cosine_similarity(without_postings_embeddings, with_postings_embeddings)

print(f"‚úÖ Similarity matrix: {similarities.shape}\n")
print(f"üîÑ Inferring skills for {len(companies_without_postings):,} companies...\n")

for i in range(len(companies_without_postings)):
    if i % 10000 == 0:
        print(f"   Progress: {i:,}/{len(companies_without_postings):,}")
    
    # Get top-K similar companies WITH postings
    top_k_indices = np.argsort(similarities[i])[::-1][:TOP_K_SIMILAR]
    
    # Collect skills from similar companies
    similar_skills = []
    similar_titles = []
    similar_levels = []
    
    for idx in top_k_indices:
        similar_company = companies_with_postings.iloc[idx]
        
        if similar_company.get('required_skills'):
            similar_skills.append(str(similar_company['required_skills']))
        
        if similar_company.get('posted_job_titles'):
            similar_titles.append(str(similar_company['posted_job_titles']))
        
        if similar_company.get('experience_levels'):
            similar_levels.append(str(similar_company['experience_levels']))
    
    # Aggregate (simple concatenation)
    inferred_skills.append(' | '.join(similar_skills) if similar_skills else '')
    inferred_titles.append(' | '.join(similar_titles) if similar_titles else '')
    inferred_levels.append(' | '.join(similar_levels) if similar_levels else '')

print(f"\n‚úÖ Skill inference complete!\n")

# Add to companies_without_postings
companies_without_postings['required_skills'] = inferred_skills
companies_without_postings['posted_job_titles'] = inferred_titles
companies_without_postings['experience_levels'] = inferred_levels

# Mark as inferred
companies_without_postings['skills_source'] = 'inferred_cf'
companies_with_postings['skills_source'] = 'actual_postings'

print(f"üìä RESULTS:")
non_empty = sum(1 for s in inferred_skills if s != '')
print(f"   Successfully inferred skills: {non_empty:,}/{len(inferred_skills):,} ({non_empty/len(inferred_skills)*100:.1f}%)\n")

üîç Finding similar companies for skill inheritance...

üìä Method: Top-5 Collaborative Filtering

‚öôÔ∏è  Calculating company-to-company similarities...


ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
# ============================================================================
# STEP 4: Rebuild companies_full with INFERRED skills
# ============================================================================

print("üîÑ Rebuilding companies_full with inferred skills...\n")

# Combine
companies_full_enhanced = pd.concat([
    companies_with_postings,
    companies_without_postings
], ignore_index=False).sort_index()

print(f"‚úÖ Enhanced dataset created!")
print(f"   Total companies: {len(companies_full_enhanced):,}")
print(f"   With actual postings: {len(companies_with_postings):,}")
print(f"   With inferred skills: {len(companies_without_postings):,}")

# Verify
total_with_skills = companies_full_enhanced[companies_full_enhanced['required_skills'] != ''].shape[0]
print(f"\nüìà IMPROVEMENT:")
print(f"   BEFORE: {len(companies_with_postings):,} companies with skills ({len(companies_with_postings)/len(companies_full)*100:.1f}%)")
print(f"   AFTER: {total_with_skills:,} companies with skills ({total_with_skills/len(companies_full_enhanced)*100:.1f}%)")
print(f"   üìä Increase: +{total_with_skills - len(companies_with_postings):,} companies!\n")

# Replace companies_full
companies_full = companies_full_enhanced

print(f"‚úÖ companies_full updated with collaborative filtering!\n")

In [None]:
# ============================================================================
# STEP 5: Regenerate Company Embeddings with INFERRED skills
# ============================================================================

print("üîÑ Regenerating company embeddings with inferred skills...\n")

def build_company_text_enhanced(row):
    """Build company text WITH inferred/actual skills"""
    parts = []
    
    if row.get('name'):
        parts.append(f"Company: {row['name']}")
    
    if row.get('description'):
        parts.append(f"Description: {row['description']}")
    
    if row.get('industries_list'):
        parts.append(f"Industries: {row['industries_list']}")
    
    if row.get('specialties_list'):
        parts.append(f"Specialties: {row['specialties_list']}")
    
    # NOW INCLUDES INFERRED SKILLS!
    if row.get('required_skills'):
        parts.append(f"Required Skills: {row['required_skills']}")
    
    if row.get('posted_job_titles'):
        parts.append(f"Job Titles: {row['posted_job_titles']}")
    
    if row.get('experience_levels'):
        parts.append(f"Experience: {row['experience_levels']}")
    
    return ' '.join(parts)

# Build texts
company_texts_enhanced = companies_full.apply(build_company_text_enhanced, axis=1).tolist()

print(f"üìù Encoding {len(company_texts_enhanced):,} enhanced company profiles...\n")

comp_vectors_enhanced = model.encode(
    company_texts_enhanced,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

print(f"\n‚úÖ Enhanced embeddings created!")
print(f"   Shape: {comp_vectors_enhanced.shape}")

# Replace global comp_vectors
comp_vectors = comp_vectors_enhanced

print(f"\nüéØ NOW candidates & companies speak the SAME LANGUAGE!")
print(f"   All companies have skill information (actual or inferred)")
print(f"   Ready for matching!\n")

# Save
np.save(f'{Config.PROCESSED_PATH}company_embeddings_cf_enhanced.npy', comp_vectors)
print(f"üíæ Saved: company_embeddings_cf_enhanced.npy\n")

### üîç Example: Check Inferred Skills

In [None]:
# ============================================================================
# EXAMPLE: See skill inference in action
# ============================================================================

print("üîç COLLABORATIVE FILTERING EXAMPLE")
print("=" * 80)

# Find a company that got inferred skills
inferred_companies = companies_full[companies_full['skills_source'] == 'inferred_cf']

if len(inferred_companies) > 0:
    example = inferred_companies.iloc[0]
    
    print(f"\nüìã COMPANY (skills were INFERRED):")
    print(f"   Name: {example.get('name', 'N/A')}")
    print(f"   Industries: {str(example.get('industries_list', 'N/A'))[:100]}")
    print(f"   Specialties: {str(example.get('specialties_list', 'N/A'))[:100]}")
    print(f"\n   üéØ INFERRED Required Skills:")
    print(f"      {str(example.get('required_skills', 'N/A'))[:200]}")
    print(f"\n   üíº INFERRED Job Titles:")
    print(f"      {str(example.get('posted_job_titles', 'N/A'))[:200]}")
    
    print(f"\nüí° These skills were inherited from similar companies!")
else:
    print("\n‚ö†Ô∏è  No inferred companies found")

print("\n" + "=" * 80)

---
## üß† Step 8: Generate OR Load Embeddings

**Smart pipeline:**
- First run: Generate embeddings (slow ~5 min)
- Subsequent runs: Load from file (fast <5 sec)

**CRITICAL:** Embeddings generated AFTER deduplication for perfect alignment!

In [None]:
# ============================================================================
# EMBEDDING GENERATION + SAVE/LOAD PIPELINE
# ============================================================================

import os
from pathlib import Path

print("üß† EMBEDDING PIPELINE")
print("=" * 80)
print()

# Ensure processed directory exists
Path(Config.PROCESSED_PATH).mkdir(parents=True, exist_ok=True)

# Define file paths
CAND_EMBEDDINGS_FILE = f'{Config.PROCESSED_PATH}candidate_embeddings.npy'
COMP_EMBEDDINGS_FILE = f'{Config.PROCESSED_PATH}company_embeddings.npy'
CAND_METADATA_FILE = f'{Config.PROCESSED_PATH}candidates_metadata.pkl'
COMP_METADATA_FILE = f'{Config.PROCESSED_PATH}companies_metadata.pkl'

# Check if embeddings already exist
cand_exists = os.path.exists(CAND_EMBEDDINGS_FILE)
comp_exists = os.path.exists(COMP_EMBEDDINGS_FILE)

print(f"üìÅ Checking for existing embeddings...")
print(f"   Candidates: {'‚úÖ Found' if cand_exists else '‚ùå Not found'}")
print(f"   Companies: {'‚úÖ Found' if comp_exists else '‚ùå Not found'}")
print()

# Load model
print(f"üîß Loading embedding model: {Config.EMBEDDING_MODEL}")
model = SentenceTransformer(Config.EMBEDDING_MODEL)
embedding_dim = model.get_sentence_embedding_dimension()
print(f"‚úÖ Model loaded! Dimension: {embedding_dim}\n")

In [None]:
# ============================================================================
# CANDIDATE EMBEDDINGS - Generate or Load
# ============================================================================

if cand_exists:
    print("üì• LOADING candidate embeddings from file...")
    cand_vectors = np.load(CAND_EMBEDDINGS_FILE)
    print(f"‚úÖ Loaded: {cand_vectors.shape}")
    
    # Verify alignment
    if len(cand_vectors) != len(candidates):
        print(f"\n‚ö†Ô∏è  WARNING: Size mismatch!")
        print(f"   Embeddings: {len(cand_vectors):,}")
        print(f"   Dataset: {len(candidates):,}")
        print(f"\nüîÑ Regenerating...")
        cand_exists = False

if not cand_exists:
    print("üîÑ GENERATING candidate embeddings...")
    print(f"   Processing {len(candidates):,} candidates...\n")
    
    # Build text representations
    def build_candidate_text(row):
        parts = []
        
        if row.get('Category'):
            parts.append(f"Job Category: {row['Category']}")
        
        if row.get('skills'):
            parts.append(f"Skills: {row['skills']}")
        
        if row.get('career_objective'):
            parts.append(f"Objective: {row['career_objective']}")
        
        if row.get('degree_names'):
            parts.append(f"Education: {row['degree_names']}")
        
        if row.get('positions'):
            parts.append(f"Experience: {row['positions']}")
        
        return ' '.join(parts)
    
    candidate_texts = candidates.apply(build_candidate_text, axis=1).tolist()
    
    # Generate embeddings
    cand_vectors = model.encode(
        candidate_texts,
        show_progress_bar=True,
        batch_size=32,
        normalize_embeddings=True,
        convert_to_numpy=True
    )
    
    # Save
    np.save(CAND_EMBEDDINGS_FILE, cand_vectors)
    candidates.to_pickle(CAND_METADATA_FILE)
    
    print(f"\nüíæ Saved:")
    print(f"   {CAND_EMBEDDINGS_FILE}")
    print(f"   {CAND_METADATA_FILE}")

print(f"\n‚úÖ CANDIDATE EMBEDDINGS READY")
print(f"   Shape: {cand_vectors.shape}")
print(f"   Dataset size: {len(candidates):,}")
print(f"   Alignment: {'‚úÖ PERFECT' if len(cand_vectors) == len(candidates) else '‚ùå MISMATCH'}\n")

In [None]:
# ============================================================================
# COMPANY EMBEDDINGS - Generate or Load
# ============================================================================

if comp_exists:
    print("üì• LOADING company embeddings from file...")
    comp_vectors = np.load(COMP_EMBEDDINGS_FILE)
    print(f"‚úÖ Loaded: {comp_vectors.shape}")
    
    # Verify alignment
    if len(comp_vectors) != len(companies_full):
        print(f"\n‚ö†Ô∏è  WARNING: Size mismatch!")
        print(f"   Embeddings: {len(comp_vectors):,}")
        print(f"   Dataset: {len(companies_full):,}")
        print(f"\nüîÑ Regenerating...")
        comp_exists = False

if not comp_exists:
    print("üîÑ GENERATING company embeddings...")
    print(f"   Processing {len(companies_full):,} companies...")
    print(f"   IMPORTANT: Generated AFTER deduplication for alignment!\n")
    
    # Build text representations
    def build_company_text(row):
        parts = []
        
        if row.get('name'):
            parts.append(f"Company: {row['name']}")
        
        if row.get('description'):
            parts.append(f"Description: {row['description']}")
        
        if row.get('industries_list'):
            parts.append(f"Industries: {row['industries_list']}")
        
        if row.get('specialties_list'):
            parts.append(f"Specialties: {row['specialties_list']}")
        
        # Include job postings data (THE BRIDGE!)
        if row.get('required_skills'):
            parts.append(f"Required Skills: {row['required_skills']}")
        
        if row.get('posted_job_titles'):
            parts.append(f"Job Titles: {row['posted_job_titles']}")
        
        if row.get('experience_levels'):
            parts.append(f"Experience Levels: {row['experience_levels']}")
        
        return ' '.join(parts)
    
    company_texts = companies_full.apply(build_company_text, axis=1).tolist()
    
    # Generate embeddings
    comp_vectors = model.encode(
        company_texts,
        show_progress_bar=True,
        batch_size=32,
        normalize_embeddings=True,
        convert_to_numpy=True
    )
    
    # Save
    np.save(COMP_EMBEDDINGS_FILE, comp_vectors)
    companies_full.to_pickle(COMP_METADATA_FILE)
    
    print(f"\nüíæ Saved:")
    print(f"   {COMP_EMBEDDINGS_FILE}")
    print(f"   {COMP_METADATA_FILE}")

print(f"\n‚úÖ COMPANY EMBEDDINGS READY")
print(f"   Shape: {comp_vectors.shape}")
print(f"   Dataset size: {len(companies_full):,}")
print(f"   Alignment: {'‚úÖ PERFECT' if len(comp_vectors) == len(companies_full) else '‚ùå MISMATCH'}\n")

In [None]:
# ============================================================================
# FINAL VERIFICATION
# ============================================================================

print("üîç FINAL ALIGNMENT CHECK")
print("=" * 80)
print()

print(f"üìä CANDIDATES:")
print(f"   Dataset rows: {len(candidates):,}")
print(f"   Embedding vectors: {len(cand_vectors):,}")
print(f"   Status: {'‚úÖ ALIGNED' if len(candidates) == len(cand_vectors) else '‚ùå MISALIGNED'}")
print()

print(f"üìä COMPANIES:")
print(f"   Dataset rows: {len(companies_full):,}")
print(f"   Embedding vectors: {len(comp_vectors):,}")
print(f"   Status: {'‚úÖ ALIGNED' if len(companies_full) == len(comp_vectors) else '‚ùå MISALIGNED'}")
print()

if len(candidates) == len(cand_vectors) and len(companies_full) == len(comp_vectors):
    print("üéØ PERFECT ALIGNMENT! Ready for matching!")
    print("\nüí° Next runs will LOAD embeddings (fast!)")
else:
    print("‚ö†Ô∏è  ALIGNMENT ISSUE DETECTED")
    print("   Delete .npy files and regenerate")

print("\n" + "=" * 80)