Portfolio
Shahriar Shayesteh
PhD Student, Pennsylvania State University | [cite_start]Responsible AI & Applied NLP [cite: 1, 2]
My work develops AI systems that reveal how personal data flows through digital services, spanning website privacy notices and the use of LLM-based agents that process user data and connect with third-party services. I also focus on safety and robustness in tool calling for LLMs and AI agents, addressing their susceptibility to unsafe or malicious tool use.
Highlights
- SoACer: Sector-based website classification tool that can classify websites into a set of predefined sectors at scale.
- SoAC Corpus: Released dataset for sector-based website classification.
- PrivaSeer: Contribution to PrivaSeer, Privacy-policy Search engine indexed around 3 million documents.
- Semi-supervised NLP: Designing GAN + negative data augmentation as part of my Master’s project to improve text classification with limited labels.
Projects
SoACer — Sector-Aware Website Classification
Code: Repo · Dataset: SoAC Corpus · Paper: ACM DocEng 2025
Problem. Classify websites into real-world sectors despite noisy/long HTML and heterogeneous content.
Contributions.
- LexRank summarization (graph centrality) to extract salient site text (plain: most important sentences).
- LLaMA embeddings + a lightweight classifier head for efficient sector prediction.
- Released dataset + evaluation tailored to multi-sector web signals.
Figure: Pipeline
End-to-end: HTML → LexRank → Embeddings → Classifier.
Image needed: soacer_pipeline.png — End-to-end pipeline (HTML→Summarize→Embed→Classify)
Figure: Corpus Stats
Distribution of samples by sector (coverage/imbalance).
Image needed: soac_corpus_stats.png — Top-10 sectors bar chart (#sites)
PrivaSeer — Privacy Policy Search & Analytics
Poster/Paper: SOUPS 2025
Personal report: PrivaSeer_report.pdf (role & contributions)
Problem. Make privacy policies searchable and analyzable at scale; support exploration of practices across sectors.
Contributions.
- Ingestion/crawling, indexing, and query experience for millions of policies.
- Visualization of entities, data types, and purposes (ontology = domain-specific concept list).
- Sector tagging and analytics to reveal normative vs. outlier practices.
Figure: System Architecture
Crawler → Indexer → Search UI → Visual Analytics.
Image needed: privaseer_arch.png — High-level architecture block diagram
Master’s: GAN with Negative Data Augmentation (Semi-supervised NLP)
Paper: FLAIRS · Thesis: uOttawa
Problem. Improve text classification when labeled data is scarce.
Contributions.
- GAN framework augmented with “negative” examples to sharpen decision boundaries.
- Gains under semi-supervised settings.
Figure: Training Flow
Generator/Discriminator with negative sampling path.
Image needed: gan_neg_aug.png — GAN + negative augmentation flow
Publications & Posters
- SoACer — ACM DocEng 2025. DOI
- PrivaSeer — SOUPS 2025 Poster. PDF
- Generative Adversarial Learning with Negative Data Augmentation — FLAIRS. Paper
- Master’s Thesis — University of Ottawa. PDF
Datasets & Open Source
- SoAC Corpus — Multi-sector website dataset for classification and analysis.
Link: https://huggingface.co/datasets/Shahriar/SoAC_Corpus
Skills
- Programming: Python, PyTorch, HF Transformers, IR tooling
- ML/NLP: Semi-supervised learning, summarization, embeddings, evaluation
- Systems/IR: Crawling, indexing, large-scale text processing, visualization
- Communication: Academic writing, posters, talks, collaborative research
Talks / Teaching
- DocEng 2025 — SoACer presentation (slides TBD).
- SOUPS 2025 — PrivaSeer poster (link above).
.jpeg)