A Top-Down k-Anonymization Implementation for Apache Spark
Abstract
Data science continues to evolve with each passing day and upgrades itself according to the exponentially increasing amount of data. The progression provides convenience to extract meaningful information from the huge amount of data from various domains including individual, public health, micro-blogging and sensors. The ability to process huge volume of data and to extract valuable information sometimes scare people especially when individual sensitive data is concerned. Many data privacy-preserving techniques are developed to overcome these fears. Over the years, these techniques are adapted to meet emerging type and increasing volume of data. For instance, to cope with today's big data we need more scalable and efficient methods. Big data platforms like Apache Hadoop and Apache Spark are highly utilized for this purpose. In this paper we study k-anonymization problem in the context of big data and develop a top-down specialization anonymization solution for Apache Spark platform. An extensive experimental evaluation has been carried out and the efficiency results are presented.