Technical Dossier: NLP Sentiment Pipeline
Subject: Quantifying Public Trust in Law Enforcement within Multilingual (English/Swahili/Sheng) Civic Ecosystems.
I engineered this pipeline to bridge the gap between global NLP tools and the reality of Kenyan civic tech. The work covers the full Machine Learning lifecycle: from cleaning 10,000+ noisy rows of social metadata to persisting a production-ready model artifact. I used transformer-assisted labeling to bootstrap initial signals before building a domain-adapted baseline optimized for precision in law-enforcement contexts.
This tool lowers the "cost of listening" for public institutions. By automating the screening of large-scale civic feedback, small teams can identify trust-erosion signals in real-time without expensive manual labor. Built entirely on an open-source Python stack, the system is highly portable and requires minimal server overhead, making it a sustainable choice for resource-constrained public service monitoring.
1) Install dependencies with
python -m pip install -r requirements-dev.txt
2) Re-run training with
sentiment-train --data-path kenya_labeled_with_flagged_sentiment.csv --output-dir artifacts
3) Confirm generated outputs in
artifacts/metrics.json, artifacts/classification_report.txt, and artifacts/confusion_matrix.csv
4) Run baseline integrity checks with
pytest tests/test_train_smoke.py tests/test_portfolio.py
- [JSON]
artifacts/metrics.json- Performance validation against domain baselines. - [CSV]
artifacts/confusion_matrix.csv- Error analysis on code-switched samples. - [TXT]
artifacts/classification_report.txt- Per-class precision/recall metrics. - [BIN]
artifacts/model.joblib- Serialized trained model for reproducible inference. - [PY]
src/sentiment_demo/api.py- FastAPI inference engine for model serving. - [PY]
src/sentiment_demo/train.py- End-to-end training pipeline used to regenerate artifacts. - [TEST]
tests/test_train_smoke.py- Smoke validation for training workflow stability.