GPT-4o · Pinecone · tree-sitter

Ask your codebase anything.

Semantic search over yt-dlp's 120,000-line source. Ask in plain English. DevLens retrieves the exact functions and returns cited, grounded answers.

Searching codebase…

Under the hood

How DevLens works.

A three-stage pipeline from raw source code to cited, grounded answers, built on production-grade components.

Step 01

Ingest & Parse

Every Python file is parsed with tree-sitter to build a concrete syntax tree. Top-level functions and classes are extracted as self-contained semantic chunks, each with its full source and exact line range.

tree-sitter Python AST Chunking

Step 02

Embed & Index

Each chunk is encoded into a 1,536-dimensional vector using OpenAI's text-embedding-3-small. Vectors are stored in Pinecone for sub-millisecond nearest-neighbour retrieval via semantic similarity, not keyword matching.

OpenAI Embeddings Pinecone 1536-dim

Step 03

Retrieve & Answer

Your question is embedded in real time. The top‑8 most similar chunks are passed to GPT-4o as grounding context. It synthesises a precise answer backed exclusively by real source code, with file path and line number citations.

GPT-4o RAG Top-K Retrieval
Source Files
.py  120k lines
tree-sitter
AST chunks
OpenAI API
embed-3-small
Pinecone
vector store
GPT-4o
answer + cite

FYP · NUCES

The research.

The academic context behind DevLens — the problem it solves, the research gap it addresses, and the scope of what was built and evaluated.

Problem Statement

The Cost of Comprehension

58–70% of developer time is spent reading existing code, not writing new code

Developers spend the majority of their working time not writing code, but trying to understand it. When a new developer joins a team, weeks can pass before they are productive. Today’s AI coding tools compound this with a new cost: every query to a cloud AI assistant burns paid API tokens. Those costs multiply across every developer, every onboarding cycle, every codebase. DevLens solves this by ingesting a codebase once into a vector database and answering questions against it — so the cost of understanding is paid once, not thousands of times.

Token Cost Onboarding RAG

Gap Reduction

Beyond Prior Work

Prior research on RAG for code — most recently Zhang et al. (EMNLP 2025) — evaluates only automated code generation. No prior work applies AST-based chunking to the developer comprehension problem. DevLens also preserves two pieces of information prior work discards: the class a function belongs to, and the plain-English docstring that describes what it does.

Use Case Developer Q&A, not automated code generation
Parent Context Dot-qualified naming preserves class membership per chunk
NL in Embeddings Docstrings included within AST chunk boundaries
cAST EMNLP 2025 Docstrings

Project Scope

What Was Built

DevLens ingests Python codebases using Abstract Syntax Tree parsing, stores semantically complete code chunks in a vector database, and answers developer questions with source references. The project includes a formal evaluation study comparing AST-based chunking against two naive baselines using LLM-as-judge methodology on a benchmark of natural language questions about a real-world codebase.

In Scope

  • AST ingestion pipeline
  • ChromaDB vector store
  • FastAPI query backend
  • Evaluation study (20–30 NL Qs)
  • Web interface

Future Work

  • Multi-language support
  • IDE integration
  • Cloud deployment
ChromaDB FastAPI LLM-as-judge

Research Positioning

Where DevLens sits relative to prior work on two key axes
TARGET ZONE CODE-SPECIFICITY DEVELOPER Q&A FOCUS LOW HIGH LOW HIGH Lewis 2020 Foundational RAG (NLP) Gao 2023 RAG Survey Zhang cAST 2025 Code Generation DevLens FYP · 2025