I have a large set of phrases obtained via Azure Fast Transcription, and I need to group them into coherent semantic chunks (to use later in a RAG pipeline).
Initially, I tried grouping phrases based on speaker pauses (e.g., merging phrases when pauses are below a certain threshold), but this approach isn’t generic enough — different speakers have very different pause patterns (some pause for 0.5s, others for 2s, even within the same recording).
Due to project constraints, I can’t use complex NLP models or embeddings, so I’m looking for a lightweight or heuristic-based approach to merge consecutive phrases into semantically meaningful chunks.