Inspired by _DavidSmith's Podcast Search, I transcribed every episode of
the Hello Internet podcast using whisper.cpp. After cleaning up the
transcripts into well-formatted sentences, I grouped consecutive related sentences into
'chunks', using this approach. I then created embeddings for
each chunk using OpenAI's text-embedding-ada-002.
These are stored in a Postgres database with the pgvector extension. When you enter a search query, an embedding is generated for that query and the database
is searched for the most similar chunks. This works well, notably even when your query doesn't
contain the exact words found in the transcipt.
Update:
Update:
- I re-chunked the episodes using an approach similar to LLMChunker.
- I contextualised the chunks using Anthropic's Contextual Retrieval approach.
- Google's text-embedding-004 is now used for embeddings.
- Cloudflare Vectorize and KV are used for storage, not Postgres. This is fast and free.
- There is now an LLM (currently Gemini 1.5 Flash) that re-ranks the most relevant chunks and provides a helpful written response.
- I hope to integrate BM25 for better ranking in the future.