Portfolio

Shahriar Shayesteh

PhD Student, Pennsylvania State University | [cite_start]Responsible AI & Applied NLP [cite: 1, 2]

My work develops AI systems that reveal how personal data flows through digital services, spanning website privacy notices and the use of LLM-based agents that process user data and connect with third-party services. I also focus on safety and robustness in tool calling for LLMs and AI agents, addressing their susceptibility to unsafe or malicious tool use.

Highlights

SoACer: Sector-based website classification tool that can classify websites into a set of predefined sectors at scale.
SoAC Corpus: Released dataset for sector-based website classification.
PrivaSeer: Contribution to PrivaSeer, Privacy-policy Search engine indexed around 3 million documents.
Semi-supervised NLP: Designing GAN + negative data augmentation as part of my Master’s project to improve text classification with limited labels.

Projects

SoACer — Sector-Aware Website Classification

Code: Repo · Dataset: SoAC Corpus · Paper: ACM DocEng 2025

Problem. Classify websites into real-world sectors despite noisy/long HTML and heterogeneous content.
Contributions.

LexRank summarization (graph centrality) to extract salient site text (plain: most important sentences).
LLaMA embeddings + a lightweight classifier head for efficient sector prediction.
Released dataset + evaluation tailored to multi-sector web signals.

Figure: Pipeline
SoACer Pipeline End-to-end: HTML → LexRank → Embeddings → Classifier.
Image needed: soacer_pipeline.png — End-to-end pipeline (HTML→Summarize→Embed→Classify)

Figure: Corpus Stats
SoAC Corpus Stats Distribution of samples by sector (coverage/imbalance).
Image needed: soac_corpus_stats.png — Top-10 sectors bar chart (#sites)

PrivaSeer — Privacy Policy Search & Analytics

Poster/Paper: SOUPS 2025
Personal report: PrivaSeer_report.pdf (role & contributions)

Problem. Make privacy policies searchable and analyzable at scale; support exploration of practices across sectors.
Contributions.

Ingestion/crawling, indexing, and query experience for millions of policies.
Visualization of entities, data types, and purposes (ontology = domain-specific concept list).
Sector tagging and analytics to reveal normative vs. outlier practices.

Figure: System Architecture
PrivaSeer Architecture Crawler → Indexer → Search UI → Visual Analytics.
Image needed: privaseer_arch.png — High-level architecture block diagram

Master’s: GAN with Negative Data Augmentation (Semi-supervised NLP)

Paper: FLAIRS · Thesis: uOttawa

Problem. Improve text classification when labeled data is scarce.
Contributions.

GAN framework augmented with “negative” examples to sharpen decision boundaries.
Gains under semi-supervised settings.

Figure: Training Flow
GAN with Negative Augmentation Generator/Discriminator with negative sampling path.
Image needed: gan_neg_aug.png — GAN + negative augmentation flow

Publications & Posters

SoACer — ACM DocEng 2025. DOI
PrivaSeer — SOUPS 2025 Poster. PDF
Generative Adversarial Learning with Negative Data Augmentation — FLAIRS. Paper
Master’s Thesis — University of Ottawa. PDF

Datasets & Open Source

SoAC Corpus — Multi-sector website dataset for classification and analysis.
Link: https://huggingface.co/datasets/Shahriar/SoAC_Corpus

Skills

Programming: Python, PyTorch, HF Transformers, IR tooling
ML/NLP: Semi-supervised learning, summarization, embeddings, evaluation
Systems/IR: Crawling, indexing, large-scale text processing, visualization
Communication: Academic writing, posters, talks, collaborative research

Talks / Teaching

DocEng 2025 — SoACer presentation (slides TBD).
SOUPS 2025 — PrivaSeer poster (link above).