Portfolio

Shahriar Shayesteh

PhD Student, Pennsylvania State University | [cite_start]Responsible AI & Applied NLP [cite: 1, 2]

GitHub HF Datasets Google Scholar >

My work develops AI systems that reveal how personal data flows through digital services, spanning website privacy notices and the use of LLM-based agents that process user data and connect with third-party services. I also focus on safety and robustness in tool calling for LLMs and AI agents, addressing their susceptibility to unsafe or malicious tool use.

Highlights

Projects

SoACer — Sector-Aware Website Classification

Code: Repo · Dataset: SoAC Corpus · Paper: ACM DocEng 2025

Problem. Classify websites into real-world sectors despite noisy/long HTML and heterogeneous content.
Contributions.

Figure: Pipeline
SoACer Pipeline End-to-end: HTML → LexRank → Embeddings → Classifier.
Image needed: soacer_pipeline.png — End-to-end pipeline (HTML→Summarize→Embed→Classify)

Figure: Corpus Stats
SoAC Corpus Stats Distribution of samples by sector (coverage/imbalance).
Image needed: soac_corpus_stats.png — Top-10 sectors bar chart (#sites)


PrivaSeer — Privacy Policy Search & Analytics

Poster/Paper: SOUPS 2025
Personal report: PrivaSeer_report.pdf (role & contributions)

Problem. Make privacy policies searchable and analyzable at scale; support exploration of practices across sectors.
Contributions.

Figure: System Architecture
PrivaSeer Architecture Crawler → Indexer → Search UI → Visual Analytics.
Image needed: privaseer_arch.png — High-level architecture block diagram


Master’s: GAN with Negative Data Augmentation (Semi-supervised NLP)

Paper: FLAIRS · Thesis: uOttawa

Problem. Improve text classification when labeled data is scarce.
Contributions.

Figure: Training Flow
GAN with Negative Augmentation Generator/Discriminator with negative sampling path.
Image needed: gan_neg_aug.png — GAN + negative augmentation flow


Publications & Posters

Datasets & Open Source

Skills

Talks / Teaching