LLM-Powered ETL Pipeline
•
LLMFastAPIAWS DynamoDBAirflowZillizRAGLark
Overview
For WT Asset Management, I delivered a long-running, production-grade ETL pipeline integrating LLMs with multi-source data ingestion, powering analysis for 800+ companies.
The Challenge
- Aggregate multi-source data daily
- Deliver insights reliably for hundreds of companies
- Enable real-time querying of corporate knowledge
Architecture Decisions
LLM-Powered ETL
- Web scraping with Selenium and cron jobs
- FastAPI services orchestrating daily ingestion
- Data stored in AWS DynamoDB
Chatbot with RAG
- Integrated with Lark for real-time Q&A
- Used Zilliz vector DB for retrieval
Key Learnings
- LLM pipelines must be productionized with monitoring and retries.
- Reliability > flashy models in enterprise workflows.
Metrics
- 14+ months continuous production uptime
- 800+ companies ingested daily
- Deployed chatbot serving real-time analysts