Competition tasks - IDASH PRIVACY & SECURITY WORKSHOP 2019

Four tracks of competition tasks

Track I: Distributed Gene-Drug Interaction Data Sharing based on Blockchain and Smart Contracts
The goal of this track is to develop smart contracts on a blockchain network to share gene-drug interaction data in a distributed way.
Experimental setting: Given a set of gene-drug interaction outcome data, design a time/space efficient data structure and mechanisms to share (i.e. store and retrieve) data based on Ethereum Solidity.
Challenge: The gene-drug interaction outcome data is represented using the following columns: gene-name, variant-number (e.g. between 1 - 99), drug-name, outcome (e.g. improved, unchanged, or deteriorated), suspected-gene-outcome-relation (e.g. yes or no), and serious-side-effect (e.g. yes or no). For example, “gene-name=HLA-B, variant-number = 57, drug-name = abacavir, outcome = improved, suspected-gene-outcome-relation = yes, and serious-side-effect = no”. All data and intermediate data (e.g. index or cache) must be saved on-chain (i.e. no off-chain data storage allowed) via smart contracts, programs which execute within the blockchain. We have chosen smart contracts because, compared to traditional off-chain programs, they are transparent (i.e. every node can verify who deployed the program and can make sure it is using the right version of the program) and immutable (i.e. the deployed program is not alterable, and any new versions of the program are recorded and visible to all nodes). We will provide the skeleton of the smart contracts. Note that the implementation of a smart contract is required to allow the insertion of one line of the gene-drug outcome data at a time. Each participant can determine how each line is represented and stored in the smart contract; it does not need to be a plain text copy of the data entry. The query function in the smart contract will allow a user to search using any field of one data line (e.g. gene-name, variant-number, drug-name) as well as any “AND” combination (e.g. gene-name AND variant-number). The returned results will include the gene-name, variant-number, drug-name and the counts and percentages of each outcome, suspected-gene-outcome-relation, and serious-side-effect. The participants should not use any third-party libraries. There will be 4 nodes in the blockchain network, and 4 data files to be stored. Users should be able to query the data from any of the 4 nodes. Participants can implement any algorithm to store, retrieve and present the data correctly and efficiently.
Requirement: The participants must consent to release any code and/or binaries submitted to the competition under the GNU Lesser General Public License v3.0 Open Source license.
Evaluation Criteria: The data sharing system will need to demonstrate good performance (i.e. accurate query results) by using a test dataset, which is different from the one provided online. We will evaluate the speed, storage/memory cost, and scalability of each solution. We will provide the parameters and software versions to construct the Ethereum test bed for fairness. We plan to use Go-Ethereum platform, and no modification of the underlying Go-Ethereum source code is allowed. The submission should include Solidity source code.
Data and Code Skeleton: link
FAQ: link
Evaluation code: link
Link to submission: link

Track II: Secure Genotype Imputation using Homomorphic Encryption
The goal of this track is to develop a homomorphic encryption (HE) based method for performing genotype imputation.
Experimental setting: Given a variant-genotype dataset with missing entries, design an efficient and accurate HE method for imputing missing entries.
Challenge: Genotype imputation is used to predict missing and low-quality variant genotypes in large scale genotyping projects. The basic motivation is to make use of correlations between genotypes of existing variants to estimate genotypes of missing variants. In this challenge, the participants are expected to analyze the tradeoff between complicated (potentially more powerful) machine learning algorithms at (more expensive and challenging) HE evaluation vs. simple (potentially less powerful) machine learning algorithms at (less expensive and easy-to-realize) HE evaluation for the problem of genotype imputation.
The challenge dataset contains two parts: First part is the target SNPs. Target SNPs comprise 500 SNPs whose genotypes will need to be imputed. The second part of the data are the predictor (or tag) SNPs. This is the list of SNPs that will be used as input to the imputation models to predict genotypes of the target SNPs. We provide 2 sets of tag SNP loci (in total of approximately 9,500 SNPs). Training data including target SNPs and 2 sets of tag SNPs will be provided in this task (elaborated in the paragraph after next). The predictor sets differ from each other with respect to the average genomic distance between consecutive SNPs.
For each set of tag SNPs, participants are expected to build an imputation model using the genotypes of the tag SNPs as inputs. These models will then generate genotype predictions for all the target SNPs. The models will be trained in plaintext domain but the final submission must encrypt (unseen/test) tag SNPs and use them as inputs to impute target SNPs (500 at the same position in the provided training data). The participants must provide a linux script that will be used for encrypting to the tag SNP genotype matrix (See below).
The genotype data for target SNPs and 2 sets of predictor SNPs are provided as a training dataset in 3 plaintext matrices where rows are SNPs and columns are individuals. These datasets contain in total around 10,000 SNPs for 2,500 individuals selected from two 10MB long regions on chromosome 1. In the performance assessment, the genotype matrix (from a database that is undisclosed to participants) for each of the 2 sets of tag SNPs will be fed separately into trained models. This unseen dataset contains genotypes for less than 500 individuals. The output from the model is required to be an encrypted file that contains the genotype scores for each SNP and for each individual. We will decrypt the output from the models and perform assessments. For each individual, the imputation model must generate a genotype score/probability (between 0 and 1) for each genotype (0, 1, 2). Note that each set of tag SNPs will be run and assessed independently. These scores will be used to build the AUC scores for the methods.
The genotype imputation methods can be developed based on statistical testing or machine learning models. The participants can choose any method that they would like to apply. In addition, the methods need to be trained and parameterized using the training data that we provide. Please do not use external or private data for the fairness in competing model performance. The performance of the methods for each of the 2 tag SNP lists will be evaluated independently. For the solution, the participants can provide multiple versions of models (up to 3) as long as they are reasonably manageable in terms of size and fits into the docker container of our specified configuration. That is, given each of the 2 tag SNP lists, participant’s solution can return separate outputs of different versions (<=3) of one model.
The participants are required to provide the training algorithm code to ensure fair usage of training data among participants. Also, the models must take into account that there may be small amount of missing values in the tag SNP genotypes. The participants can use any homomorphic encryption library in their implementation. However, the performance will be a metric in evaluation and efficient libraries can increase the overall score of the methods.
File Formats: We do not enforce any file format requirements except for the unencrypted genotype matrices. An example of this can be seen in the training data that we provide among the datasets. The matrix is a tab delimited text file where, first 4 columns are the chromosome, start, end, variant name entries. Starting from 5th column, the genotype is encoded as 0, 1, 2 where 0 denotes homozygous reference, 1 denotes heterozygous, and 2 denoted by homozygous alternate. The training dataset represents an example of the file format that we will use in evaluation. In the evaluation, we will use the 2 tag SNP genotype matrices as inputs to each participants’ models. These matrices will be run separately so participants must exactly specify how to run the prediction model using each tag SNP matrix. The output from the models is required to generate a genotype score for each genotype of each individual for each target SNP. The documentations must clearly describe how to encrypt the tag SNP genotype matrix, run the model(s), and parse the output to extract the genotype score for each genotype for each individual for each target SNP.
Encryption Requirement: The security level of HE schemes must be set at least 128 bits. We request teams to use the parameter settings in “5.4 TABLES of RECOMMENDED PARAMETERS” of HE standardization white paper. If a team wants to use sparse secret keys (less explored than non-sparse secrets), requesting team needs to explain the security in details (e.g., how the parameters are obtained using Martins' estimator)
~~Code submission and license agreement:~~ ~~The participants must consent to release any code and/or binaries submitted to the competition under the GNU Lesser General Public License v3.0 Open Source license.~~ We do not enforce the public release of code/binary but will encourage teams to make available their solutions under open source software license for the entire community.
Evaluation Environments: All submissions will be evaluated using Docker container on physical servers. The container will be capped with 4 CPU cores (Intel Xeon Platinum 8180 CPU @ 2.50GHz), 32 GB memory, and 500 GB storage in evaluation.
Evaluation Criteria: The solutions will be evaluated in terms of imputation performance and efficiency. We cap the ~~running time~~ total roundtrip time (including encryption, computation, and decryption) for all solutions at ~~30 minutes~~ 10 minutes for imputing 500 target SNPs. For solutions meeting the efficiency criterion, we will measure micro-AUC as the sample-level imputation performance (to account for target label imbalance) and produce a final ranking. See link .
Output API: Each submission should output probabilities of each target label (0, 1, 2) for each sample in a .CSV file. The other output will be time (round trip, encryption, computation, decryption) (as a separate file). ~~Three~~ Two different models up to three versions of each model based on different tag SNPs should share the same output API and allow users to select which model to conduct imputation.
Dataset: link
Evaluation code: link
Link to submission: link
FAQ: link

Track III: Privacy-preserving Machine Learning as a Service on SGX

Background: Today trained machine learning (ML) models are offered as a service through a cloud platform (e.g., the AWS), which enable clients who do not have resources or expertise to build their own models to make predictions (e.g., disease susceptibility) on their data using the service. When the clients' data are sensitive (e.g., representing the genome variants of human subjects), however, proper protection should be in place to ensure that they are not exposed to untrusted parties, including the ML service provider. On the other hand, the provider who trained the ML model may also want to protect the model, avoiding disclosure of model parameters and in some cases even the architecture to the client. Trusted Executive Environment (TEE, e.g., Intel's SGX) provides an ideal infrastructure for hosting such privacy-preserving ML inference as a service, where the ML model and the input data can both be protected: 1) only the computing task (i.e., the ML inference in this case) approved by both the provider and the client (i.e., the owner of the input data) is allowed be performed on the data; and 2) the client does not have access to the model, except providing input data and receiving inference results, and the provider cannot see the content of the input and inference result, which are all encrypted in a way that only the client and the TEE can decrypt. For this purpose, an efficient implementation of a trained ML model for a trusted execution environment is the key. So the purpose of this task is to understand the capabilities of today's mainstream TEE in providing real-world support for such a computing mission.
Challenge: We challenge participating teams to implement a given deep learning model on the Intel SGX platform, so the model can operate inside the SGX enclave. The implementation should protect both the ML model and its input data: that is, any input, intermediate and output data (including model parameters) should be encrypted outside the enclave. However, we do not consider side channel leaks in this task.
Experimental setting: We will provide a testing dataset along with a trained deep learning model. Each team is challenged to implement the model under SGX. For this purpose, the team is allowed to optimize the model and adjust it so it can work in the enclave, as long as its accuracy is largely preserved and privacy is fully protected (except side channel leaks). The testing dataset will be used to evaluate the implemented model. The solution may utilize the computational resource outside the enclave, including the CPU, memory and hard disk, as long as all the data and the model are fully protected (encrypted). The submitted solution cannot involve any addition party. Pre-computing time will be measured as part of the performance overhead.
Evaluation Environment: All submissions will be evaluated on a single node with Intel Xeon E3-1280 v5 processor (4 physical cores, Hyper-Threading enabled) and 64 GiB memory. The PRM size is 128 MiB.
Requirement: Each participating team should submit their implementation together with source code. We will provide remote access to an SGX system at Indiana University for the team to install their system for evaluation. The teams are responsible for the compatibility of their implementation and the system. The evaluation will be done using un-released testing data.
Evaluation Criteria: All submissions should meet security requirements. Also we expect that the protected model mostly preserves the accuracy of the original model but will also compare different models' performance when their accuracy comes close.
Dataset: link The dataset contains the expression levels of 12,634 genes (features) profiled using microarray from 100 breast cancer samples: 50 complete response samples (negative samples) and 50 residual disease samples (positive samples). The ML model we release is for predicting the responses of the cancer treatment (response vs. residual)[1,2]
FAQ: link
Submission: We will assign you an account to a remote SGX machine. You may test your submitted program on the machine and ensure it works in the designated environment. You should specify the command line to execute your program in the environment. If you are interested in submitting a solution, please contact Haixu Tang (hatang@indiana.edu) to get an account to the SGX machine -- please include the following information in the email: Team name, Institution, contact person and email address and encrypted password (you can generate it at https://help.sice.indiana.edu/ITG/md5it/ and include the resulting SHA-512 Value string in the email). Please contact Diyue Bu (diybu@indiana.edu) to get the link for the submission (please include your track number in the email).
The testbed of SGX server is as following:
    ISA: x86_64
    OS: GNU/Linux Ubuntu 16.04.6 LTS
    Compiler: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
    SGXSDK version: 1.8
Reference

Haozhe Xie, Jie Li, Qiaosheng Zhang, and Yadong Wang. Comparisonamong dimensionality reduction techniques based on random projectionfor cancer classification.Computational biology and chemistry, 65:165–172,2016.
Christos Hatzis, Lajos Pusztai, Vicente Valero, Daniel J Booser, Laura Es-serman, Ana Lluch, Tatiana Vidaurre, Frankie Holmes, Eduardo Souchon,Hongkun Wang, et al. A genomic predictor of response and survival follow-ing taxane-anthracycline chemotherapy for invasive breast cancer.Jama,305(18):1873–1881, 2011.

Track IV: Secure Collaborative Training of Machine Learning Model

Background: Training a modern machine learning model often require a large amount of data. Oftentimes, however, data owners could be reluctant to share their data (e.g., genome data from human subjects) due to privacy concerns even though they also desire to build a better trained model. Therefore, it becomes highly important to allows two or more owners to build a joint ML model using a secure computing protocol such as Secure Multiparty Computation (SMC). This task is designed to understand the efficiency of the SMC implementation achievable in building a machine learning model for supporting a secure collaboration.
Experimental setting: We will provide two testing datasets, and each participating team will submit an implementation of a general training algorithm so that each testing dataset can be used to train a model. We will provide an ML model directly trained on the data as a benchmark.(ML model link) The solution does not need to use the same model as the benchmark, but it is supposed to perform similarly.
Requirement: The solution should follow the security standard of SMC. The meta-data, i.e. the number of features/records should be made public, as well as the final model. However, any information that can't be inferred from that should not be leaked. Each computing node shouldn't learn anything about the data (i.e., the feature values for each record).
Evaluation Criteria: Submissions are qualified if they fulfill the security requirements. Qualified solutions are ranked based on their performances, including their prediction accuracy (how close it is comparing to the model built by the non-secure algorithm), total running time, and the communication cost (the rounds and sizes of data exchange among computing nodes in the SMC). The evaluation team will run the training code on the released data for up to 24 hours. The solutions that do not complete within 24 hours will be disqualified.
Dataset:link The testing datasets BC-TCGA and GSE2034 are gene expression data obtained from breast cancer patients. BC-TCGA is part of comprehensive molecular portraits of human breast tumors processed by TCGA, collected by using a microarray platform[1]. The dataset consists of expression levels of 17,814 genes (features) from 48 normal tissue samples (negative samples) and 422 breast cancer tissue samples (positive samples)[2]. GSE2034 is obtained by using the Affymetrix Human U133a GeneChips from frozen tumour samples[3]. It contains the expression levels of 12,634 genes from 225 breast cancer samples, in which 142 are recurrence tumor samples (positive samples) and the rest 83 samples are non-recurrence samples (negative samples) [2].
FAQ: link
Submission: Please install your solution and any required software or library in the docker (image can be downloaded at link; see the documentation at link), and submit the resulting docker file. You should make sure your program is executable within the docker configuration. ~~Please set each party's address as 127.0.0.1 and use a different port as the communication channel.~~ For evaluation purpose, we will put each docker container (for each party) on different machine/node. Thus, please do not put all three parties on one machine/node. The testbed of track 4 is running ~~docker CE 19.03.01~~ docker 17.06.0-ce, on ~~Ubuntu 18.04~~ Ubuntu 16.04 with Linux kernel version ~~4.15.0-55-generic x86_64~~ 4.4.0-150-generic x86_64. Please make sure the runtime environments of submissions are the same with the testbed. The memory of the test machine is 8GB with 4G swap and its CPU is ~~i5-3470 @3.2GHz with 4 processors~~ Intel Xeon E3-1280 v5. As for multi-party evaluation, your implementation should allow us to config/specify/assign/modify the IP address of your docker image, so that we could deploy your submission in a right way on our cluster. Please contact Diyue Bu (diybu@indiana.edu) to get the link for the submission (please include your track number in the email).
Reference

Cancer Genome Atlas Network et al. Comprehensive molecular portraits ofhuman breast tumours.Nature, 490(7418):61, 2012.
Haozhe Xie, Jie Li, Qiaosheng Zhang, and Yadong Wang. Comparisonamong dimensionality reduction techniques based on random projectionfor cancer classification.Computational biology and chemistry, 65:165–172,2016.
Yixin Wang, Jan GM Klijn, Yi Zhang, Anieta M Sieuwerts, Maxime PLook, Fei Yang, Dmitri Talantov, Mieke Timmermans, Marion E Meijer-vanGelder, Jack Yu, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.The Lancet, 365(9460):671–679, 2005