← All work
Infrastructure · 2023–24

Conversation Topic Modeling & Auto-Labeling Pipeline (BERTopic + LLM)

An enterprise conversational-AI platform

Overview

A Python batch pipeline that mines chatbot conversation logs for topics: it clusters message embeddings with BERTopic and uses an Azure-hosted LLM to generate human-readable topic labels, writing results back to Firestore. It powers the conversation-analytics layer of the enterprise chatbot platform.

The Challenge

At enterprise scale, chatbots accumulate huge volumes of conversations. Product and customer teams need to know what users are actually asking about, but raw message logs are unstructured. This pipeline turns that firehose into discoverable, labeled topics for analytics.

What We Built

A containerized Python job (index.py, custom_label.py, poc_fixed_categories.py) that batch-reads messages per project from Firestore (firebase-admin, paged in batches with 500-op commit batching), runs BERTopic to cluster conversations into topics, and uses Azure OpenAI (gpt-35-turbo via AzureOpenAI) as the topic representation/labeling model, with tenacity for resilient retries against the LLM. An .autolabel SQLite cache supports labeling, and results are written back to Firestore. It runs on Kubernetes as scheduled work (job.yaml, cronjob-main.yaml, cronjob-dev.yaml) built from a Dockerfile, with secrets managed via secret.yaml and a service-account key.

Technologies & Approach

BERTopic for unsupervised topic discovery over conversation embeddings, with an LLM (Azure OpenAI GPT-3.5) generating readable cluster labels, a hybrid classical-ML + LLM approach. Firestore is both source and sink, and Kubernetes CronJobs make it a recurring, hands-off analytics process.

Outcome / Impact

Gave the platform automated, recurring insight into conversation themes per project, feeding the dashboards that let customer teams understand what users ask. Demonstrates a production data-engineering + ML pipeline rather than a notebook build.

Capabilities Demonstrated

  • Topic modeling over conversational data (BERTopic)
  • LLM-assisted topic labeling with retry resilience
  • Batch processing of large Firestore datasets
  • Containerized, scheduled Kubernetes job pipelines
More work See all →