Competition tasks - iDASH Privacy & security workshop 2017

Three tracks of competition tasks

Track 1: De-duplication for Global Alliance for Genomics and Health (GA4GH)
The goal of this track is to develop privacy preserving patient linkage (PPRL) technique on top of existing European ENCCA Unified Patient Identifier (EUPID) framework to facilitate the deduplication task in GA4GH.
Experimental setting: Given hashed patient attributes (identification number, first and last name, gender, etc.) to develop efficient secure multiparty (>=3) linkage protocols that scale well to real world applications (e.g., thousands of centers and millions of records in total). The solution needs to ensure secure communciation in the whole process.
Challenge: We will simulate the data to generate input data for the following scenario (1000 centers, each containing ~10,000 records, ~0.1% duplicated records between most centers, but some pairs of centers may contain up to 10% duplications). Assuming the records are entered at a random pace locally, these centers will communicate with a set of k (>=3) servers (assuming they are honest majority) to jointly determine if the current record is observed somewhere else. The final outputs should a list of unique IDs representing all duplicated records among different centers. We consider a semi-honest security model in this secure multiparty computation setup. All communication should be done using TLS
Evaluation Criteria: We will evaluate the accuracy, communication cost, computation cost, and overall turnaround time with increasing number of records at each center.
Note: Any submitted solutions may choose to allocate a special server among k servers to be required in each SMC of deduplication. Some special computation can be assigned to this server, but the entire secure computation should be highly distributed among all k servers.
Dataset: Download here

Track 2: Software Guard Extension (SGX) based whole genome variants search
The goal of this track is to develop scalable solutions using secure hardware (i.e., SGX) to enable secure whole genome variants search among multiple individuals.
Experimental setting: Given a database of WGS VCFs of 500 records (labeled with case/control), participants should use SGX to generate top K most significant SNPs (based on chi-squared test).
Challenge: the size of data will be more than 4GB, which is beyond the capacity of current SGX (including the paging model). So, it is necessary to devise some partition mechanism (e.g., horizontal, vertical, etc.) to load the data into the enclave for securely calculating allele frequency, chi-square statistics, and top K most significant SNPs. In this process, it might be beneficial to use compression or multiple enclaves or multi-threading model to speedup computation. These are real world challenges to be handled in this task in order to demonstrate good performance.
Evaluation: We will check the security compliance (at least 128-bit security level), memory usage and speed. Note that we might increase the cohort size to evaluate the scalability and our labeling (case/control) will be different from the one provided online to ensure that the solution is generalizable (not customized to a single setting of parameters).
Dataset: Download here

Track 3: Homomorphic encryption (HME) based logistic regression model learning
The goal of this track is to develop HME based secure solutions for building a machine learning model (i.e., logistic regression) over encrypted data.
Experimental setting: given the genotype/phenotype data about two cohorts (disease vs. healthy), devise a machine learning model to predict the disease.
Challenge: Develop homomorphic algorithms for training a logistic regression model (e.g., estimate its parameters). Participants can implement any optimization algorithms (e.g., as discussed in Tom Minka's tutorial) to solve the learning problem of logistic regression. In this process, it might be beneficial to use approximation of the sigmoid function and fixed-point representation (to support fraction arithmetic).
Evaluation: The homomorphically learned model needs to demonstrate good performance (e.g., well-estimated parameters when compared with the original parameters learned on plaintext) by using a private training/testing dataset, which is different from the one provided online. We will evaluate the speed, storage/memory cost, and generalizability of each solution. The dataset will include up to 200 records and at least 5 binary covariates (we may test the scalability of the solution with up to 1000 records and 100 covariates). A 80-bit security level is required for all encrypted data and computation on them.
Dataset: Download here

Evaluation environments
For Track 2, all the submissions will be evaluated over SGX enabled Intel Skylake Xeon CPUs. For Tracks 1 and 2, all submissions will be evaluated with VMs running on ivy bridge or newer generation Xeon CPUs.

FAQ: See Google Doc Here

COMPETITION TASKS IN 2017

Three tracks of competition tasks