Research & Services

Services | Book | Publications | Datasets

Services

Served as Program Committee Member/Invited Reviewer at some of the leading conferences in Machine Learning:

Book

Sculpting Data for ML: The first act of Machine Learning

The book is endorsed by Julian McAuley, Professor at UC San Diego, Laurence Moroney, AI Lead Advocate at Google, and Mengting Wan, Senior Applied Scientist at Microsoft.

Abstract: In the contemporary world of Artificial Intelligence and Machine Learning, _data is the new oil_. For Machine Learning algorithms to work their magic, it is imperative to lay a firm foundation with relevant data. Sculpting Data for ML introduces the readers to the first act of Machine Learning, Dataset Curation. This book puts forward practical tips to identify valuable information from the extensive amount of crude data available at our fingertips. The step-by-step guide accompanies code examples in Python from the extraction of real-world datasets and illustrates ways to hone the skills of extracting meaningful datasets. In addition, the book also dives deep into how data fits into the Machine Learning ecosystem and tries to highlight the impact good quality data can have on the Machine Learning system's performance.

Publications

  • Springer Nature's Deep Learning for Social Media Data Analytics
    • Do Not ‘Fake It Till You Make It’! Synopsis of Trending Fake News Detection Methodologies Using Deep Learning
      Book Chapter by Rishabh Misra and Jigyasa Grover, accepted for publication in Springer Nature Book "Deep Learning for Social Media Data Analytics", September 2022, ISBN: 978-3-031-10868-6.
    • Book Chapter
    • Citation Information
      Text format:
      1. Misra, Rishabh and Jigyasa Grover. "Do Not ‘Fake It Till You Make It’! Synopsis of Trending Fake News Detection Methodologies Using Deep Learning." Deep Learning for Social Media Data Analytics (2022).

      BibTex format:
      @incollection{misra2022not,
        title={Do Not ‘Fake It Till You Make It’! Synopsis of Trending Fake News Detection Methodologies Using Deep Learning},
        author={Misra, Rishabh and Grover, Jigyasa},
        booktitle={Deep Learning for Social Media Data Analytics},
        pages={213--235},
        year={2022},
        publisher={Springer}
      }
  • WSDM'20
    • Addressing Marketing Bias in Product Recommendations
      Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley, in Proceedings of 2020 ACM Conference on Web Search and Data Mining (WSDM'20), Houston, TX, USA, Feb. 2020. (15% acceptance rate)
    • Paper | Data and Code
    • Citation Information
      Text format:
      1. Wan, Mengting, Jianmo Ni, Rishabh Misra, and Julian McAuley. "Addressing Marketing Bias in Product Recommendations." Proceedings of the 13th International Conference on Web Search and Data Mining (2020).

      BibTex format:
      @inproceedings{wan2020addressing,
        title={Addressing marketing bias in product recommendations},
        author={Wan, Mengting and Ni, Jianmo and Misra, Rishabh and McAuley, Julian},
        booktitle={Proceedings of the 13th international conference on web search and data mining},
        pages={618--626},
        year={2020}
      }
  • ACL'19
    • Fine-Grained Spoiler Detection from Large-Scale Review Corpora
      Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, in Proceedings of 57th Annual Meeting of the Association for Computational Linguistics 2019 (ACL'19), Florence, Italy, Jul. 2019. (18% acceptance rate)
    • Paper | Dataset | Media: NBC, Gizmodo, Geek.com, UCSD News/ UC News, TechXplore
    • Citation Information
      Text format:
      1. Wan, Mengting, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. "Fine-Grained Spoiler Detection from Large-Scale Review Corpora." Proceedings of the 57th Conference of the Association for Computational Linguistics (2019).

      BibTex format:
      @article{wan2019fine,
        title={Fine-grained spoiler detection from large-scale review corpora},
        author={Wan, Mengting and Misra, Rishabh and Nakashole, Ndapa and McAuley, Julian},
        journal={arXiv preprint arXiv:1905.13416},
        year={2019}
      }
  • RecSys'18
    • Decomposing Fit Semantics for Product Size Recommendation in Metric Spaces
      Rishabh Misra, Mengting Wan, Julian McAuley, in Proceedings of 2018 ACM Conference on Recommender Systems (RecSys'18), Vancouver, Canada, Oct. 2018. (25% acceptance rate)
    • Paper | Code | Datasets
    • Citation Information
      Text format:
      1. Misra, Rishabh, Mengting Wan, and Julian McAuley. "Decomposing fit semantics for product size recommendation in metric spaces." Proceedings of the 12th ACM Conference on Recommender Systems (2018).

      BibTex format:
      @inproceedings{misra2018decomposing,
        title={Decomposing fit semantics for product size recommendation in metric spaces},
        author={Misra, Rishabh and Wan, Mengting and McAuley, Julian},
        booktitle={Proceedings of the 12th ACM Conference on Recommender Systems},
        pages={422--426},
        year={2018},
        organization={ACM}
      }
  • MUSE'15
    • Scalable Bayesian Matrix Factorization
      Avijit Saha*, Rishabh Misra*, Balaraman Ravindran, In Proceedings of the 6th International Workshop on Mining Ubiquitous and Social Environments (MUSE) @ PKDD/ECML, 2015 Sep 7 (pp. 43-54), Porto, Portugal. (* equal contribution)
    • Paper | Code
    • Citation Information
      Text format:
      1. Saha, Avijit, Rishabh Misra, and Balaraman Ravindran. "Scalable Bayesian Matrix Factorization." Proceedings of 6th International Workshop on Mining Ubiquitous and Social Environments (MUSE), co‑located with the ECML PKDD (2015).

      BibTex format:
      @inproceedings{saha2015scalable,
        title={Scalable Bayesian matrix factorization},
        author={Saha, Avijit and Misra, Rishabh and Ravindran, Balaraman},
        booktitle={Proceedings of the 6th International Conference on Mining Ubiquitous and Social Environments-Volume 1521},
        pages={43--54},
        year={2015}
      }
  • Preprints
    • Sarcasm Detection using Hybrid Neural Network
      Rishabh Misra and Prahal Arora, arXiv preprint arXiv:1908.07414 (2019). Paper | Code
      Citation Information
      Text format:
      1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).

      BibTex format:
      @article{misra2019sarcasm,
        title={Sarcasm Detection using Hybrid Neural Network},
        author={Misra, Rishabh and Arora, Prahal},
        journal={arXiv preprint arXiv:1908.07414},
        year={2019}
      }
    • Hotel Recommendation System
      Aditi A Mavalankar*, Ajitesh Gupta*, Chetan Gandotra*, Rishabh Misra*, arXiv preprint arXiv:1908.07498 (2018) *equal contribution. Paper
      Citation Information
      Text format:
      1. Mavalankar, Aditi A, Ajitesh Gupta, Chetan Gandotra, and Rishabh Misra. "Hotel recommendation system." arXiv preprint arXiv:1908.07498 (2019).

      BibTex format:
      @article{mavalankar2019hotel,
        title={Hotel recommendation system},
        author={Mavalankar, Aditi A and Gupta, Ajitesh and Gandotra, Chetan and Misra, Rishabh},
        journal={arXiv preprint arXiv:1908.07498},
        year={2019}
      }
    • Scalable Variational Bayesian Factorization Machine
      Avijit Saha, Rishabh Misra, Ayan Acharya, and Balaraman Ravindran, preprint 2017. Paper | Code
      Citation Information
      Text format:
      1. Saha, Avijit, Rishabh Misra, Ayan Acharya, and Balaraman Ravindran. "Scalable variational Bayesian factorization machine." ResearchGate, DOI: 10.13140/RG.2.2.31607.73126 (2017).

      BibTex format:
      @article{saha2017scalable,
        title={Scalable variational Bayesian factorization machine},
        author={Saha, Avijit and Misra, Rishabh and Acharya, Ayan and Ravindran, Balaraman}
      }

Datasets

  • Politifact Fact Check Dataset (New)
    • We present a high-quality fact-check dataset collected from a popular fact check website PolitiFact. The dataset contains 21,152 statements that are fact checked by experts. All the statements are categorized into one of 6 categories: true, mostly true, half true, mostly false, false, and pants on fire. Along with various details around fact checking, we also include sources where the statement appeared, which could be crucial for extracting various insights about fact checking. Furthermore, we provide links to the fact check article published on Politifact so that extra text can be extracted regarding the published fact check story if needed.
    • Link to Kaggle page
    • Please cite these articles if you use the dataset
      Text format:
      1. Misra, Rishabh and Jigyasa Grover. "Do Not ‘Fake It Till You Make It’! Synopsis of Trending Fake News Detection Methodologies Using Deep Learning." Deep Learning for Social Media Data Analytics (2022).
      2. Misra, Rishabh. "Politifact Fact Check Dataset." DOI: 10.13140/RG.2.2.29923.22566 (2022).

      BibTex format:
      @incollection{misra2022not,
        title={Do Not ‘Fake It Till You Make It’! Synopsis of Trending Fake News Detection Methodologies Using Deep Learning},
        author={Misra, Rishabh and Grover, Jigyasa},
        booktitle={Deep Learning for Social Media Data Analytics},
        pages={213--235},
        year={2022},
        publisher={Springer}
      }

      @dataset{misra2022politifact,
        author = {Misra, Rishabh},
        year = {2022},
        month = {09},
        pages = {},
        title = {Politifact Fact Check Dataset},
        doi = {10.13140/RG.2.2.29923.22566}
      }
  • News Headlines Dataset For Sarcasm Detection
    • Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets. To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from HuffPost.
    • Link to Kaggle page (33k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset
      Text format:
      1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using Hybrid Neural Network." arXiv preprint arXiv:1908.07414 (2019).
      2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 978-0-578-83125-1 (2021).

      BibTex format:
      @article{misra2019sarcasm,
        title={Sarcasm Detection using Hybrid Neural Network},
        author={Misra, Rishabh and Arora, Prahal},
        journal={arXiv preprint arXiv:1908.07414},
        year={2019}
      }

      @book{misra2021sculpting,
        author = {Misra, Rishabh and Grover, Jigyasa},
        year = {2021},
        month = {01},
        pages = {},
        title = {Sculpting Data for ML: The first act of Machine Learning},
        isbn = {978-0-578-83125-1}
      }
  • News Category Dataset
    • People rely on daily news to know what is happening around the world. In today’s world, when the proliferation of fake news is rampant, having a large-scale and high-quality source of authentic news articles with the published category information would be valuable to learning authentic news’ Natural Language syntax and semantics. This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. To make it more useful, I have included the source links of the news articles so that more data can be extracted as needed. Utility of this dataset is multifold: it could be used to produce interesting liguistic insights about the language used in different news articles or to simply identify untracked news articles.
    • Link to Kaggle page (37k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset
      Text format:
      1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
      2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 978-0-578-83125-1 (2021).

      BibTex format:
      @article{misra2022news,
        title={News Category Dataset},
        author={Misra, Rishabh},
        journal={arXiv preprint arXiv:2209.11429},
        year={2022}
      }

      @book{misra2021sculpting,
        author = {Misra, Rishabh and Grover, Jigyasa},
        year = {2021},
        month = {01},
        pages = {},
        title = {Sculpting Data for ML: The first act of Machine Learning},
        isbn = {978-0-578-83125-1}
      }
  • Clothing Fit Dataset for Size Recommendation
    • Product size recommendation and fit prediction are critical in order to improve customers’ shopping experiences and to reduce product return rates. However, modeling customers’ fit feedback is challenging due to its subtle semantics, arising from the subjective evaluation of products and imbalanced label distribution (most of the feedbacks are "Fit"). These datasets, which are the only fit related datasets available publically at this time, collected from ModCloth and RentTheRunWay could be used to address these challenges to improve the recommendation process.
    • Link to Kaggle page (7k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset
      Text format:
      1. Misra, Rishabh, Mengting Wan, and Julian McAuley. "Decomposing fit semantics for product size recommendation in metric spaces." In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 422-426. 2018.
      2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 978-0-578-83125-1 (2021).

      BibTex format:
      @inproceedings{misra2018decomposing,
        title={Decomposing fit semantics for product size recommendation in metric spaces},
        author={Misra, Rishabh and Wan, Mengting and McAuley, Julian},
        booktitle={Proceedings of the 12th ACM Conference on Recommender Systems},
        pages={422--426},
        year={2018},
        organization={ACM}
      }

      @book{misra2021sculpting,
        author = {Misra, Rishabh and Grover, Jigyasa},
        year = {2021},
        month = {01},
        pages = {},
        title = {Sculpting Data for ML: The first act of Machine Learning},
        isbn = {978-0-578-83125-1}
      }
  • IMDB Spoiler Dataset
    • User-generated reviews are often our first point of contact when we consider watching a movie or a TV show. However, beyond telling us the qualitative aspects about the item we want to consume, reviews may inevitably contain undesired revelatory information (i.e. 'spoilers') such as the surprising fate of a character in a movie, or identity of a murderer in a crime-suspense movie etc. For users who are interested in consuming the item but are unaware of the critical plot twists, spoilers may decrease the excitement regarding the pleasurable uncertainty and curiosity of media consumption. Therefore, a natural question is how to identify these spoilers in entertainment reviews, so that users can more effectively navigate review platforms. This dataset is collected from IMDB and contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not.
    • Link to Kaggle page (2k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset
      Text format:
      1. Misra, Rishabh. "IMDB Spoiler Dataset." DOI: 10.13140/RG.2.2.11584.15362 (2019).

      BibTex format:
      @dataset{misra2019imdb,
        author = {Misra, Rishabh},
        year = {2019},
        month = {05},
        pages = {},
        title = {IMDB Spoiler Dataset},
        doi = {10.13140/RG.2.2.11584.15362}
      }