LLM-Powered ETL Pipeline

LLMFastAPIAWS DynamoDBAirflowZillizRAGLark

Overview

For WT Asset Management, I delivered a long-running, production-grade ETL pipeline integrating LLMs with multi-source data ingestion, powering analysis for 800+ companies.

The Challenge

  • Aggregate multi-source data daily
  • Deliver insights reliably for hundreds of companies
  • Enable real-time querying of corporate knowledge

Architecture Decisions

LLM-Powered ETL

  • Web scraping with Selenium and cron jobs
  • FastAPI services orchestrating daily ingestion
  • Data stored in AWS DynamoDB

Chatbot with RAG

  • Integrated with Lark for real-time Q&A
  • Used Zilliz vector DB for retrieval

Key Learnings

  • LLM pipelines must be productionized with monitoring and retries.
  • Reliability > flashy models in enterprise workflows.

Metrics

  • 14+ months continuous production uptime
  • 800+ companies ingested daily
  • Deployed chatbot serving real-time analysts