Competition tasks - IDASH PRIVACY & SECURITY WORKSHOP 2019

Three tracks of competition tasks

Track I: Secure multi-label Tumor classification using Homomorphic Encryption

The goal of this track is to develop homomorphic encryption (HE) based multi-label classification method for predicting the classes of tumor samples based on genetic information.

This is a secure outsourcing scenario. We expect the model to be trained in plaintext using the challenge dataset. The trained plain model will be used for evaluation on reserved homomorphically encrypted data (assuming it can be hosted on untrusted servers) for performance measurement.

Experimental setting: Given a genetic variant dataset from tumor samples of unknown type and origin, design an efficient and accurate HE method for classifying the types of the tumor samples.
Challenge:

Tumor classification is important to understand the molecular composition of the tumor cells and propose diagnosis and treatment for cancer patients. For this track, the tumors are genetically profiled and the mutations in the tumors that are identified. The mutations are stored in a text file. The text file will contain the chromosome, position, the mutation type, and the mutation alleles (i.e., the DNA letters in the mutation). In addition, for each tumor, the challenge dataset will contain the tumor type and other information (such as location).

The participants are going to use the provided challenge data to build classification models that predict tumor type and other information in the challenge dataset. The participants can choose any method that they would like to apply. Commons choices would be multi-label logistic regression or SVM, and there are also deep learning models [1,2] developed in the bioinformatics community.

In addition, the methods need to be trained and parameterized using the training data that we provide. Please do not use external or private data for fairness in competing model performance. The performance of the methods will be evaluated in terms of the classification accuracy with the microAUC as the main metric of comparison and timing (see below).

The participants are required to provide the training algorithm code to ensure fair usage of training data among participants. The participants can use any homomorphic encryption library of their choice in their implementation.

1. Yuan Y, Shi Y, Li C, Kim J, Cai W, Han Z, Feng DD. DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations. BMC Bioinformatics 2016 Dec 23;17(Suppl 17):476. PMID:28155641
2. Sun Y, Zhu S, Ma K, Liu W, Yue Y, Hu G, Lu H, Chen W. Identification of 12 cancer types through genome deep learning. Sci Rep 2019 Nov 21;9(1):17256. PMID:31754222

File Formats:

We will provide the tumor mutation information in a tab-delimited file where each row corresponds to a mutation and the columns correspond to the tumor samples. The tumor type and other information will be provided in a separate metadata file where each row corresponds to tumor samples and the columns correspond to the tumor information. The metadata is only needed for training.

After the output file is decrypted, we expect the output file to be in the same format as the metadata file where the predicted tumor type and other information predicted for each tumor (in the rows). The documentation must clearly describe how to encrypt the mutation matrix, run the classification model, and parse the output to extract the predicted tumor information for each tumor sample.

Encryption Requirement: The security level of HE schemes must be set at least 128 bits. We request teams to use the parameter settings in “5.4 TABLES of RECOMMENDED PARAMETERS” of HE standardization white paper. If a team wants to use sparse secret keys (less explored than non-sparse secrets), requesting team needs to explain the security in details (e.g., how the parameters are obtained using Martins' estimator)

Code submission and license agreement: We do not enforce the public release of code/binary but will encourage teams to make available their solutions under open-source software license for the entire community.

Evaluation Environments: All submissions will be evaluated using Docker container on physical servers. The container will be capped with 4 CPU cores (Intel Xeon Platinum 8180 CPU @ 2.50GHz), 32 GB memory, and 500 GB storage in evaluation.

Evaluation Criteria: The solutions will be evaluated in terms of imputation performance and efficiency. We cap the total roundtrip time (including encryption, computation, and decryption) for all solutions at 5 minutes the multi-label classification task. For solutions meeting the efficiency criterion, we will measure micro-AUC for the tumor classification performance and produce a final ranking. See link .

Output API: Each submission should output a classification output for each tumor in a .CSV file. The other output will be time (round trip, encryption, computation, decryption) (as a separate file). When multiple models are submitted, they should share the same output API and allow users to select which model to conduct classification.

Dataset: link

Link to submission: link

FAQ: link (Please make comments to the question id.)

Track II: Privacy-preserving clustering of single-cell transcriptomics data in SGX

The competitors are expected to implement an unsupervised clustering algorithms to cluster single-cell gene expression data under the protection of SGX, Intel’s trusted execution environment.

Background: In the past few years, single-cell RNA-seq technologies have advanced rapidly. Un-supervised learning methods such as dimension reduction and clustering algorithms are now widely used to group cells of the same type or subtypes based on the gene expression profiles of hundreds to thousands of single cells e.g., from tumor or normal tissues. Previous research have shown gene expression patters may reveal identifiable information about the donor of the tissues; therefore, proper protection of such data should be in place to when these data are re-analyzed (e.g., in a meta-study) by an untrusted user. Trusted Executive Environment (TEE, e.g., Intel's SGX) provides an ideal infrastructure for hosting such privacy-preserving analyses of single-cell transcriptomics data by a data user, in which 1) only the computing task (i.e., clustering of single-cell gene expression profiles in this case) approved by the owner of the input data is allowed be performed on the data; and 2) the data user does not see the content of input data, which are encrypted in a way that only the client and the TEE can decrypt. For this purpose, an efficient implementation of a clustering algorithm for a trusted execution environment is the key, given the limited computing resources available inside the TEE . So the purpose of this task is to test the efficiency of an unsupervised clustering algorithm in SGX when applied to massive single-cell RNA-seq data.
Challenge: We challenge participating teams to implement a given clustering algorithm (CIDR [1]) on the Intel SGX platform, so the algorithm can operate inside the SGX enclave. The implementation should protect both the input data: that is, any input, intermediate and output data should be encrypted outside the enclave. However, we do not consider side channel leaks in this task.
Experimental setting: We will provide a testing dataset along with the plain implementation of the expected clustering algorithm. Each team is challenged to implement the model under SGX. For this purpose, the team is allowed to develop approximation algorithms so it can work in the enclave, as long as its accuracy is largely preserved and privacy is fully protected (except side channel leaks). The testing dataset will be used to evaluate the implemented model. The solution may utilize the computational resource outside the enclave, including the CPU, memory and hard disk, as long as all the data and the model are fully protected (encrypted at least 128-bit security level). The submitted solution cannot involve any addition party. Pre-computing time will be measured as part of the performance overhead.
Evaluation Environment: All submissions will be evaluated on a single node with Intel Xeon E3-1280 v5 processor (4 physical cores, Hyper-Threading enabled) and 64 GiB memory. The PRM size is 128 MiB.
Requirement: Each participating team should submit their implementation together with source code. We will provide remote access to an SGX system at Indiana University for the team to install their system for evaluation. The teams are responsible for the compatibility of their implementation and the system. The evaluation will be done using un-released testing data.
Evaluation Criteria: All submissions should meet security requirements (at least 128-bit security level). Also we expect that the protected model mostly preserves the accuracy of the original model but will also compare different models' performance when their accuracy comes close.
Dataset: link[2]
FAQ: link
Submission: We note that we cannot run submitted solutions under the root privilege. So all the participants should provide a way for us to install their own dependencies without root privilege with detailed description. For example, you may choose to use a third party execution environment such as Scone to implement your solution.
We also provide the a base docker image as an option at DockerHub: https://hub.docker.com/repository/docker/idashsgx/idash2020sgx. Please check the README for detailed usages. In this case, you can install/customize all tools/compilers/dependencies in the docker image, and we will run your submission in a docker container without a root privilege. Please note that the impact (such as performance overhead) resulting from the docker container will be included in the evaluation in this case.
Software specifications of the testbed:
      OS: GNU/Linux Ubuntu 18.04 LTS
      SGXSDK version: Intel SGX SDK 2.3
      Compiler: gcc version 7.5.0
Please contact Diyue Bu (diybu@indiana.edu) to get the link for the submission (please include your track number in the email). If a docker image from public resources is used, please submit a description file for the link and password to get the docker image.
Reference

Lin, P., Troup, M. & Ho, J.W. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol 18, 59 (2017).

Kim, N., Kim, H.K., Lee, K., Hong, Y., Cho, J.H., Choi, J.W., Lee, J.I., Suh, Y.L., Ku, B.M., Eum, H.H. and Choi, S., 2020. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nature communications, 11(1), pp.1-15.

Track III: Differentially private federated learning for cancer prediction model

The competitors are tasked to train a machine learning model on gene expression data for breast tumors, with all the data secretly shared across multiple servers.

Background: Training a modern machine learning model often requires a large amount of data distributed across multiple organizations. Oftentimes, however, data owners could be reluctant to share their data (e.g., genome data from human subjects), even in the encrypted form, due to the restrictions of their organizational privacy policies. Therefore, it becomes highly desired to allow two or more owners to build a joint ML model while ensuring the privacy protection of the data and the policy compliance. This task is designed to understand the feasibility of building such a machine learning model among multiple collaborative parties so that the data of each organization does not leave its premise and the information (e.g., intermediate parameters of the model) exchanged across the parties during the computation is properly protected under differential privacy.
Challenge: We challenge participating teams to implement a federate learning algorithm that can be trained jointly by two parties each holding their individual training datasets, i.e., gene expression levels on a group of patients with known phenotypes (disease or not). The implementation should not share the input data but can exchange intermediate results (e.g., intermediate model parameters) with the other party during the training process; however, noise should be added to the shared intermediate results to ensure differential privacy under a given budget, across the whole learning process. Each team can choose any ML algorithm to accomplish the task.
Evaluation Criteria: Submissions are qualified if they meet the differential privacy requirement under a given privacy budget. Qualified solutions will be ranked based on their performances, including their prediction accuracy., total running time, and the communication cost (the rounds and sizes of data exchange). The evaluation team will run the training code on the released data for up to 24 hours. The solutions that do not complete within 24 hours will be disqualified.
Differential Privacy (DP) model: We welcome solutions based on both the conventional ε-DP definition and the (ε,δ)-DP model. Each participating team may choose to submit two solutions each for one of these DP approaches. Please describe your approach clearly in the documentation. We will evaluate the solutions separately if there are enough submissions in each category.
Dataset: link[1]
FAQ: link
Baseline experiment: Our experiment used the official implementation from https://github.com/tensorflow/privacy to train a deep neural network on the dataset ‘BC-TCGA’[1] to achieve accuracy between 0.75-0.9 when privacy budgets various from 3 to 90, while the model trained on the whole dataset achieved an accuracy around 0.97.
Baseline sample codes: link
Submission: The submitted solution should contain two programs for the training and testing purpose, respectively. The training program may take as input a dataset of gene expression profiles with the same format as the given testing dataset, and output a model file (containing model parameters, etc, in the format recognizable by the testing program). The testing program takes as input the model file and makes prediction for a given input instance (i.e., a column of gene expression profile). The training program should operate on two separate machines that communicate intermediate results (model parameters) during training. The testing program is a standalone program running a single computer. The submitted solution should also contain a readme file explaining the differential privacy method implemented in the training program and how to run the programs. Please contact Diyue Bu (diybu@indiana.edu) to get the link for the submission (please include your track number in the email).
Reference

Haozhe Xie, Jie Li, Qiaosheng Zhang, and Yadong Wang. Comparisonamong dimensionality reduction techniques based on random projectionfor cancer classification.Computational biology and chemistry, 65:165–172,2016.

The submitted solutions will be only used by the organizers for evaluation purpose. The intellectual properties of the solutions are retained by the submitting team.