Getting Answers from a Big PDF with RubyLLM

Some API vendors give you an API doc in a giant custom-edited PDF file. In my case it’s >1200 pages, with a “helpful” table of contents that itself spans about 20 pages.

Well, I dislike reading giant PDF docs, love writing Ruby, and there’s an awesome RubyLLM gem, and Gemini supports PDF parsing, so maybe I can just throw together a quick CLI tool that can answer questions for me? Alas, Gemini is limited to 1000 pages. Either way it would probably be too wasteful to send the entire doc every time. RubyLLM supports tools, so I decided to try that out.

Reading PDF Text Locally

My doc is mostly text, there isn’t any pics in there I care about, so this part is easy. A quick search later, there’s a gem called pdf-reader. Perfect for a tool.

bin/ask_api_doc

#!/usr/bin/env ruby

require 'ruby_llm'
require 'pdf-reader'

class PdfPageReader < RubyLLM::Tool
  DOC = PDF::Reader.new('docs/big-doc.pdf')

  description 'Read the text of any set of pages from the doc.'
  param :page_numbers,
    desc: 'Comma-separated page numbers (first page: 1). (e.g. "12, 14, 15")'

  def execute(page_numbers:)
    puts "\n-- Reading pages: #{page_numbers}\n\n"
    page_numbers = page_numbers.split(',').map { _1.strip.to_i }
    pages = page_numbers.map { [_1, DOC.pages[_1.to_i - 1]] }
    {
      pages: pages.map { |num, p|
        # There are lines drawn with dots in my doc.
        # So I squeeze them to save tokens.
        { page: num, text: p&.text&.squeeze('.') }
      }
    }
  rescue => e
    { error: e.message }
  end
end

Now my LLM can use the tool to extract text from any page.

And We’re Basically Done

Unlike “draw the rest of the owl”, the rest of the code is actually pretty straightforward (goes after the above):

# Grab key from my 1Password.
GEMINI_API_KEY=`op read "op://Private/Google Gemini API Personal/credential"`

RubyLLM.configure do |config|
  config.gemini_api_key = GEMINI_API_KEY
end

chat =
  RubyLLM
    .chat(model: 'gemini-2.5-pro-preview-03-25') # Pick a model.
    .with_tool(PdfPageReader.new) # Add the tool.
    .with_instructions(<<~TEXT) # Add general instructions.
      Use provided tool to find requested info in the multi-page doc. Ask for
      multiple pages at a time to avoid roundtrips.

      Respond only with results of your findings. Don't do ascii tables, I prefer
      text and bullet points.

      To find info, use table of contents. Make sure you scan the full table of
      contents before you give up. Don't go to irrelevant parts of the doc unless
      absolutely needed.

      Total number of pages: 1249
      Table of contents is on pages: 31-49
    TEXT

response = chat.ask(ARGV.join(' ')) { |chunk|
  print chunk.content
}

# Some stats at the end
puts "\n\n-----------\n"
puts "Input tokens: #{response.input_tokens}"
puts "Output tokens: #{response.output_tokens}"
puts "Total tokens: #{response.input_tokens.to_i + response.output_tokens.to_i}"

That’s it.

Now I can ask a question and sit back, watching the llm scan table of contents, read relevant pages, and spit out a catered response. Pretty nice!

(Below is just sample output, not what’s really in my doc.)

❯ bin/ask_api_doc "what are all available statuses?"

-- Reading pages: 31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49

-- Reading pages: 1123

The available statuses are:
- `ACTIVE`: The default status for a new object.
- `INACTIVE`: The object is inactive and cannot be used.
- `PENDING`: The object is pending approval or activation.
- `ARCHIVED`: The object has been archived and is no longer active.
- `DELETED`: The object has been deleted and cannot be recovered.
- `SUSPENDED`: The object has been suspended and cannot be used.
- `EXPIRED`: The object has expired and is no longer valid.

-----------
Input tokens: 95288
Output tokens: 643
Total tokens: 95931

I bet there are more involved “talk to your docs” solutions out there, but this was quick and easy, and I can tweak it as needed. Speaking of which, let me know if you have any ideas for improving this.

Update (2025-05-26): Since I wrote this, I slightly extended it with a search tool based on pdfgrep:

class PdfPageSearch < RubyLLM::Tool
  DOC_PATH = 'docs/big-doc.pdf'
  description 'Get page numbers by a PCRE regular expression.'
  param :regex, desc: 'PCRE Regular expression to search by, case insensitive.'

  def execute(regex:)
    command = "pdfgrep --color never -inP #{regex.shellescape} #{DOC_PATH}"
    puts "\n-- Running: #{command}\n\n"
    output = `#{command}`
    pages = output.split("\n").map { _1.split(':').first.to_i }.uniq
    puts "\n-- Found results on: #{pages.size} page(s)\n\n"
    { pages: pages }
  rescue => e
    { error: e.message }
  end
end

and added it to RubyLLM like this:

chat =
  RubyLLM
    .chat(model: 'gemini-2.5-pro-preview-03-25')
    .with_tool(PdfPageReader.new)
    .with_tool(PdfPageSearch.new) # <------- HERE
    .with_instructions(<<~TEXT)
      ...
    TEXT

Also switched from Google Gemini to OpenAI o3, and together these changes considerably improved the search performance.