Technical Dossier: NLP Sentiment Pipeline

BUILD: SUCCESSFUL / VERIFIED

Subject: Quantifying Public Trust in Law Enforcement within Multilingual (English/Swahili/Sheng) Civic Ecosystems.

ENV: Python 3.10 LIB: Scikit-Learn LIB: NLTK ARCH: TF-IDF + Logistic Regression DEPLOY: FastAPI

DEV NOTE: Standard sentiment libraries (VADER, TextBlob) are historically "blind" to the nuances of Kenyan code-switching. Standard scores consistently failed on Sheng-coded protest data. This pipeline implements custom tokenization and transformer-assisted labeling to handle the high noise-to-signal ratio of local civic discourse.

Research Methodology & Ground Truth

I engineered this pipeline to bridge the gap between global NLP tools and the reality of Kenyan civic tech. The work covers the full Machine Learning lifecycle: from cleaning 10,000+ noisy rows of social metadata to persisting a production-ready model artifact. I used transformer-assisted labeling to bootstrap initial signals before building a domain-adapted baseline optimized for precision in law-enforcement contexts.

Institutional Impact (Financial Aid / Sustainability)

This tool lowers the "cost of listening" for public institutions. By automating the screening of large-scale civic feedback, small teams can identify trust-erosion signals in real-time without expensive manual labor. Built entirely on an open-source Python stack, the system is highly portable and requires minimal server overhead, making it a sustainable choice for resource-constrained public service monitoring.

Reproducibility Protocol:
1) Install dependencies with python -m pip install -r requirements-dev.txt
2) Re-run training with sentiment-train --data-path kenya_labeled_with_flagged_sentiment.csv --output-dir artifacts
3) Confirm generated outputs in artifacts/metrics.json, artifacts/classification_report.txt, and artifacts/confusion_matrix.csv
4) Run baseline integrity checks with pytest tests/test_train_smoke.py tests/test_portfolio.py

REPO: SOURCE CODE RESEARCH NOTEBOOK PROJECT DOCUMENTATION FOLDER

Evidence Artifacts

[JSON] artifacts/metrics.json - Performance validation against domain baselines.
[CSV] artifacts/confusion_matrix.csv - Error analysis on code-switched samples.
[TXT] artifacts/classification_report.txt - Per-class precision/recall metrics.
[BIN] artifacts/model.joblib - Serialized trained model for reproducible inference.
[PY] src/sentiment_demo/api.py - FastAPI inference engine for model serving.
[PY] src/sentiment_demo/train.py - End-to-end training pipeline used to regenerate artifacts.
[TEST] tests/test_train_smoke.py - Smoke validation for training workflow stability.