ArXivCode: Bridging Theory and Implementation in AI Research

Published: October 15, 2025

Overview

A semantic code search engine that bridges the gap between academic research and practical implementation. Users can search theoretical concepts from arXiv papers and instantly retrieve corresponding code implementations with contextual explanations, accelerating the journey from paper to practice.

Developed for COMS:6998 — LLM-Based Generative AI Systems (Fall 2025) at Columbia University.

Technical Approach

CodeBERT Embeddings: 768-dimensional semantic vectors for deep code understanding
Hybrid Retrieval: Optimized 60/40 weighting of semantic similarity and keyword matching
Dataset: Curated corpus of 2,490 code snippets extracted from 196 ML/AI research papers
Pipeline: Custom PDF parsing → code extraction → embedding generation → retrieval ranking
Stack: Python, Streamlit, CodeBERT (microsoft/codebert-base), Google Cloud Run

Key Features

Semantic Understanding: Finds implementations even when query terms don’t match code literally
Context-Aware Results: Returns code snippets with paper metadata and explanations
Cross-Paper Discovery: Identifies similar implementations across different research works
Scalable Architecture: Cloud-deployed for reliable access and future dataset expansion

Results & Impact

Successfully deployed production system serving the research community
Achieved effective retrieval across diverse ML domains (transformers, CNNs, RL, optimization)
Hybrid approach outperformed pure semantic or keyword-only baselines
Demonstrates practical application of transformer-based code understanding at scale

What I Learned

Implementing semantic search using pre-trained language models specialized for code
Balancing semantic embeddings with traditional keyword matching for robust retrieval
Architecting end-to-end ML systems from data ingestion through production deployment
Extracting and processing code from academic PDFs while preserving context
Optimizing retrieval systems for both relevance and interpretability

Demo

Live at arxivcode-frontend-215017069058.us-central1.run.app

Status: Completed

Timeline: Sep–Dec 2025

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Pranati Modumudi