Abstract
Massive collection of user or device data is a growing trend for personalizing services. Cheap and scalable storage is key to allow advanced analysis on the long run. In this context, block (HDFS) and columnar (Dremel, Parquet) stores are increasingly leveraged. Data protection acts, or simply the will for transparency to the user, impose to opt-out data on demand. Unfortunately, those data stores have departed from traditional databases, and do not provide efficient access and deletion to specific bits of data. In this paper, we study how to cost-efficiently opt-out user data from these stores. We apply two intuitive strategies (systematic erasure and encryption) to the context of big data systems. We model their respective costs and show that in the context of a service running atop Amazon Web Services, there is no general winner strategy (except in the special case where data cannot be compressed). Application constraints, such as data arrival and user opt-out rates, should then be considered to select the most cost-efficient opt-out strategy, while practical means of actions are the sharding policy and the careful setting of block sizes.