PageIndex and the case against chunking

The most interesting thing about PageIndex is not the 98.7% accuracy claim. It is the architectural premise underneath it: that retrieval should be an act of reasoning, not similarity matching. That a language model navigating a document’s structure — section by section, like a human analyst flipping through a report — will find better answers than an embedding projected into high-dimensional space and measured by cosine distance.

This premise deserves serious examination. Not because PageIndex has “killed” vector RAG, as some corners of the internet have enthusiastically declared, but because it exposes a structural weakness in how the field has been doing retrieval — and because the evidence for its own alternative is more incomplete than the headline suggests.

What PageIndex actually does

The system, built by VectifyAI, works in two phases. First, it transforms a document into a hierarchical tree index — essentially a machine-generated table of contents where each node carries a title, page range, summary, and child nodes. The structure mirrors the document’s own organisation: sections, subsections, numbered items, appendices.

Second, at query time, an LLM traverses this tree layer by layer. Rather than computing similarity scores against embedded chunks, the model reasons about which branch of the tree is most likely to contain the answer, descends into it, evaluates what it finds, and continues until it reaches the relevant pages. Every step is traceable — which nodes were considered, which were selected, why.

The claimed advantage is precision over similarity. When a question asks for a certification date, vector search might return a certifications table — topically related, semantically close, but functionally useless. Tree-based reasoning can navigate to the timeline section instead, because it understands the question’s intent in the context of the document’s structure.

Similarity finds content that looks like the answer. Reasoning finds content that is the answer. The distinction matters more than most benchmarks reveal.

The FinanceBench result, examined

VectifyAI’s Mafin 2.5, powered by PageIndex, achieved 98.7% accuracy on FinanceBench — a benchmark of financial questions answered from SEC filings. For context: GPT-4o alone scores roughly 31%. Perplexity reaches about 45%. Standard vector RAG pipelines land somewhere between 30% and 60%, depending on chunking strategy and retrieval configuration. The gap is not marginal.

Two things make this result credible. First, Mafin 2.5 was evaluated on 100% of the dataset, while some competitors test on only 66.7%. Second, the architectural explanation is sound — financial documents are deeply hierarchical, full of cross-references and structured tables that fixed-size chunking destroys. PageIndex preserves exactly the structure that vector pipelines discard.

But one thing makes this result insufficient as evidence for a general paradigm shift: FinanceBench tests single-document question answering. Each question targets one specific report. This is precisely the scenario PageIndex was designed for.

The scalability wall

This is where the honest analysis begins.

Alden Do Rosario, founder of CustomGPT.ai, ran PageIndex against a multi-document benchmark — Google’s SimpleQA-Verified dataset, 1,000 questions across 2,795 documents. The tree indexing could not scale. The system fell back to FAISS vector search — the very approach it claims to replace.

VectifyAI has been transparent about this. Their team acknowledged on X that PageIndex is currently designed for single long document question answering, and that for more than five documents, they support retrieval via “other customized techniques.” The open-source version uses sequential indexing described as a proof of concept, not an enterprise-ready system.

A proof of concept that achieves 98.7% on its target benchmark is impressive engineering. Presenting it as a general retrieval paradigm is a different claim entirely — and one the evidence does not yet support.

This is not a criticism of the project. It is a criticism of how the result has been received. The community’s eagerness to declare vector search dead reveals more about fatigue with the current paradigm than about the maturity of alternatives.

The metrics that are missing

Tao An made an observation that deserves more attention than it has received: after examining PageIndex’s GitHub, documentation, and every public source available, there is zero published data on latency, throughput, or cost per query.

This is not a minor documentation gap. The architecture requires multiple sequential LLM inference calls per retrieval — the model must load the tree structure, reason about which branch to follow, descend, evaluate, and potentially repeat. These calls cannot be parallelised. There is no apparent caching mechanism.

Publishing 98.7% accuracy without corresponding efficiency data is a benchmark result, not an engineering evaluation. The difference matters for anyone considering production deployment.

For a system that explicitly trades efficiency for accuracy — PageIndex’s own documentation describes tree search as prioritising “accuracy over speed” — the absence of efficiency quantification makes informed architectural comparison impossible. At what cost does that accuracy come? Ten times the latency of vector search? A hundred times? We do not know, because nobody has published the numbers.

In research, we have a term for results presented without the methodology to evaluate their tradeoffs. We call them incomplete.

Where this sits in the landscape

PageIndex is not the first system to use tree structures for retrieval. RAPTOR (Sarthi et al., ICLR 2024) builds hierarchical trees through recursive clustering and summarisation — but bottom-up, from embeddings, rather than top-down from document structure. Microsoft’s GraphRAG constructs knowledge graphs through entity extraction and community detection — powerful for multi-hop reasoning, but with construction costs that scale painfully and no support for incremental updates.

What makes PageIndex architecturally distinct is the top-down approach: deriving the tree from the document’s own hierarchy rather than constructing one from embeddings or entity relationships. This is elegant for well-structured documents. It is also the source of the scalability constraint — you need a document with structure to preserve.

One Hacker News commenter put it well: “A function called setup() might mask something important… your input data structure could build bad summaries the LLM misses.” Poorly structured documents produce poor trees. The system’s greatest strength — structural fidelity — becomes a liability when the structure is absent or misleading.

The real contribution

Strip away the “RAG killer” rhetoric and PageIndex makes a genuine, specific contribution: it demonstrates that LLM reasoning over document structure can substantially outperform similarity search for within-document retrieval on structured professional documents.

That is a narrower claim than “vector databases are obsolete.” It is also a more defensible one.

The practical future, as Do Rosario suggests, likely involves hybrid architectures: vector retrieval for document discovery across large corpora, tree-based reasoning for precise extraction within the top candidates. This is not a revolutionary idea — it is how human researchers work. You search the library catalogue to find the book, then you use the table of contents to find the chapter.

Darbit argued recently that most RAG fails at the R, not the G. PageIndex is evidence for that thesis — not because it solves retrieval universally, but because it shows how much accuracy is left on the table when retrieval ignores document structure. The 98.7% result is not proof that tree reasoning is the future of RAG. It is proof that chunking-and-embedding is leaving significant performance on the floor for an important class of documents.

That should be enough to take seriously — without pretending it is more than it is.

The measure of a research contribution is not whether it replaces everything that came before. It is whether it changes how we think about the problem. PageIndex does that. The field should engage with the idea, not the hype.

Designing a retrieval architecture that fits your documents, your queries, and your constraints — not a benchmark? Interlusion builds RAG systems grounded in engineering tradeoffs, not hype cycles. Let’s talk.