Natural conversation includes interruptions and talking over people, which is hard for an LLM to model as a single autoregressive sequence. I’m sure you can get pretty far by creating a text sequence with movie-script like breaks mid sentence, but it seems like the real solution would involve parallel streams of listening and thinking with talking queued up for pauses or rising to an interruption priority. Intermixing tokens from different streams and doing something custom with attention seems plausible.
178,04K