Research & Services
Services | Book | Publications | Datasets
Services
Served as Program Committee Member/Invited Reviewer at some of the leading conferences in Machine Learning:
Book
-
Sculpting Data for ML: The first act of Machine Learning
- Supported by Julian McAuley, Associate Professor at UC San Diego, Laurence Moroney, AI Lead Advocate at Google, and Mengting Wan, Senior Applied Scientist at Microsoft
- Abstract: In the contemporary world of Artificial Intelligence and Machine Learning, data is the new oil. For Machine Learning algorithms to work their magic, it is imperative to lay a firm foundation with relevant data. Sculpting Data for ML introduces the readers to the first act of Machine Learning, Dataset Curation. This book puts forward practical tips to identify valuable information from the extensive amount of crude data available at our fingertips. The step-by-step guide accompanies code examples in Python from the extraction of real-world datasets and illustrates ways to hone the skills of extracting meaningful datasets. In addition, the book also dives deep into how data fits into the Machine Learning ecosystem and tries to highlight the impact good quality data can have on the Machine Learning system’s performance.
Publications
-
-
Addressing Marketing Bias in Product Recommendations
Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley, in Proceedings of 2020 ACM Conference on Web Search and Data Mining (WSDM’20), Houston, TX, USA, Feb. 2020. (15% acceptance rate)
-
-
-
Fine-Grained Spoiler Detection from Large-Scale Review Corpora
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, in Proceedings of 57th Annual Meeting of the Association for Computational Linguistics 2019 (ACL’19), Florence, Italy, Jul. 2019. (18% acceptance rate)
-
Paper | Dataset | Poster | Media: TechCrunch, NBC, Gizmodo, Geek.com, UCSD News/UC News, TechXplore
-
-
Preprints
Datasets
-
IMDB Spoiler Dataset [Released: May 2019]
- User-generated reviews are often our first point of contact when we consider watching a movie or a TV show. However, beyond telling us the qualitative aspects about the item we want to consume, reviews may inevitably contain undesired revelatory information (i.e. ‘spoilers’) such as the surprising fate of a character in a movie, or identity of a murderer in a crime-suspense movie etc. For users who are interested in consuming the item but are unaware of the critical plot twists, spoilers may decrease the excitement regarding the pleasurable uncertainty and curiosity of media consumption. Therefore, a natural question is how to identify these spoilers in entertainment reviews, so that users can more effectively navigate review platforms. This dataset is collected from IMDB and contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not. (2k+ downloads on Kaggle)
-
Please cite these articles if you use the dataset (click to reveal the bibtex)
``` @dataset{dataset,
author = {Misra, Rishabh},
year = {2019},
month = {05},
pages = {},
title = {IMDB Spoiler Dataset},
doi = {10.13140/RG.2.2.11584.15362}
} ```
-
Clothing Fit Dataset for Size Recommendation [Released: August 2018]
- Product size recommendation and fit prediction are critical in order to improve customers’ shopping experiences and to reduce product return rates. However, modeling customers’ fit feedback is challenging due to its subtle semantics, arising from the subjective evaluation of products and imbalanced label distribution (most of the feedbacks are “Fit”). These datasets, which are the only fit related datasets available publically at this time, collected from ModCloth and RentTheRunWay could be used to address these challenges to improve the recommendation process. (6k+ downloads on Kaggle)
-
Please cite these articles if you use the data (click to reveal the bibtex)
``` @inproceedings{misra2018decomposing, title={Decomposing fit semantics for product size recommendation in metric spaces}, author={Misra, Rishabh and Wan, Mengting and McAuley, Julian}, booktitle={Proceedings of the 12th ACM Conference on Recommender Systems}, pages={422--426}, year={2018}, organization={ACM} } @book{book, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {978-0-578-83125-1} } ```
-
News Headlines Dataset For Sarcasm Detection [Released: June 2018]
- Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets. To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from HuffPost. (31k+ downloads on Kaggle)
-
Please cite these articles if you use the data (click to reveal the bibtex)
``` @article{misra2019sarcasm, title={Sarcasm Detection using Hybrid Neural Network}, author={Misra, Rishabh and Arora, Prahal}, journal={arXiv preprint arXiv:1908.07414}, year={2019} } @book{book, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {978-0-578-83125-1} } ```
-
News Category Dataset [Released: June 2018]
- This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. This dataset could be used to produce some interesting liguistic insights about the type of language used in different news articles or to simply identify tags for untracked news articles. (30K+ downloads on Kaggle)
-
Please cite these articles if you use the data (click to reveal the bibtex)
``` @dataset{dataset, author = {Misra, Rishabh}, year = {2018}, month = {06}, pages = {}, title = {News Category Dataset}, doi = {10.13140/RG.2.2.20331.18729} } @book{book, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {978-0-578-83125-1} } ```
Missing Citations
- MetaPrompting: Learning to Learn Better Prompts
- Sarcasm Detection in News Headlines using Supervised Learning
- A Novel Perspective to Look At Attention: Bi-level Attention-based Explainable Topic Modeling for News Classification
- Zoom Out and Observe: News Environment Perception for Fake News Detection
- Understanding the Properties of Generated Corpora
- Classifier Data Quality – A Geometric Complexity Based Method for Automated Baseline And Insights Generation
- Classical Sequence Match is a Competitive Few-Shot One-Class Learner
- On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
- Learning Label Modular Prompts for Text Classification in the Wild
- A Classification Model of Legal Consulting Questions Based on Multi-Attention Prototypical Networks
- CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation
- Fake News Analysis Modeling Using Quote Retweet
- Holistic Sentence Embeddings for Better Out-of-Distribution Detection
- Improving Library Book Retrieval By Using Topic Modeling
- MN-DS: A Multilabeled News Dataset for News Articles Hierarchical Classification
- The Best Techniques to Deal with Unbalanced Sequential Text Data in Deep Learning
- Using IBM’s Watson to automatically evaluate student short answer responses
- Meta-learning with fewer tasks through task interpolation
- Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning
- Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time
- Few-shot Text Classification with Distributional Signatures
- Short-Text Classification Using Unsupervised Keyword Expansion
- CoCon: A Self-Supervised Approach for Controlled Text Generation
- Types of Out-of-Distribution Texts and How to Detect Them
- Let the CAT out of the bag: Contrastive Attributed explanations for Text
- Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections
- PCC: Paraphrasing with Bottom-k Sampling and Cyclic Learning for Curriculum Data Augmentation