Competition tasks - IDASH PRIVACY & SECURITY WORKSHOP 2021

Three tracks of competition tasks

Track I: Data Sharing Consent for Health-Related Data Using Contracts on Blockchain
Introduction: The goal of this track is to develop smart contracts on a blockchain network to manage patient preferences for data sharing for research studies.
Experimental setting: Given a set of patients sharing preferences for certain types of data, design a time/space efficient data structure and mechanisms to share (i.e., store and retrieve) data based on Ethereum Solidity.
Challenge: A patient’s data sharing record is represented using the following columns: patient id, study id, record time (in Unix timestamp), and patient preferences array. For example, “patient id = 3380, study id = 8, record time = 1614200009134, patient preferences = [demographics = true, genetic = true, …]”. All data and intermediate data (e.g., index or cache) must be saved on-chain (i.e., no off-chain data storage allowed) via smart contracts. We will provide the skeleton of the smart contracts and scripts. Note that the implementation of a smart contract is required to allow the insertion of one line of the patient data sharing record at a time. Each participant can determine how each line is represented and stored in the smart contract; it does not need to be a plain text copy of the data entry. The query function should return a list of patient ids for which all of the preferences in the array provided as an argument to the function for the specified study id are true. For the query, the returned results should only contain the records with the latest record time for that specific (patient id and study id) pair. The participants should not use any third party libraries. There will be four nodes in the blockchain network, and four data files to be stored. Users should be able to query the data from any of the four nodes. Participants can implement any algorithm to store, retrieve and present the data correctly and efficiently.
License Requirement: This track requires every participating team to share their code and/or binaries under the BSD 3-Clause "New" or "Revised" Open Source License. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause "New" or "Revised" Open Source License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause "New" or "Revised" Open Source License.
Evaluation Criteria: The data sharing system will need to demonstrate good performance (i.e., accurate query results) when using a test dataset, which is different from the one provided online. We will evaluate the efficiency of each solution as follows. First, we will insert the patient consent records from four nodes for exactly one hour in a four-node blockchain network. Then, we will query from all of the four nodes to verify correctly inserted records. Finally, we will compute the average records per second as our final evaluation metric. We plan to evaluate the results using Go-Ethereum 1.10.1, Solidity 0.8.4 (with ABI Encoder v2 enabled), and Ubuntu 18.04. Submissions will not be evaluated using any other platforms/versions for fairness consideration. No modification of the underlying Go-Ethereum source code is allowed. The submission should include one file of Solidity source code.
Data and Code Skeleton: link
FAQ: link

Track II: Homomorphic Encryption-based Secure Viral Strain Classification
Introduction: Detection and tracking of viral strains is one of the central tasks for managing COVID19 epidemic. The idea is to sequence the viral sample from a patient and classify it to one of the known strains. This becomes important for tracking the viral strains and providing effective treatment. These viral strains are also very important for research purposes. However, there are major barriers for sharing the strain data since privacy of the patients and confidentiality of the data may not allow sharing. The task in this challenge is to develop secure methods that will classify a given viral genome (i.e. COVID19 genome) into one of the 4 different strains.
Track Description: The track requires the participants to make use of Homomorphic Encryption-based techniques and security guarantees for secure strain classification. We will provide each of the registered participating teams a challenge dataset that contains 2,000 full genomes for each of the 4 different strains (8,000 genomes in total) with their strain information. This will serve as the training data for the participants. After the challenge period is finished, the solutions will be evaluated using a hold-out dataset that contains 500 genomes per strain (2,000 genomes in total) in terms of accuracy and time/cpu usage. We assume that the challenge dataset is publicly available and does not need to be encrypted. However, the input genome must be encrypted (and stay encrypted) since it is assumed to be sensitive information. Thus, participants will train methods that accept encrypted genome data and output the encrypted strain probabilities (4 probabilities) which quantify the probability that the input genome belongs to the strains, i.e., one probability per strain.
Numerous methods are used to compare viral strains. One popular example is the tool called "Mash" (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x), which can compute distances between viral and bacterial genomes using very efficient string comparison approach. We provide Mash only as an example or a starting point. The distance computation is not central to the classification task, the participants do not have to constrain their approaches to Mash-like approaches and are encouraged to approach the secure classification in their own way.
The main challenge is to develop the classifier that can securely and efficiently perform the classification task on the encrypted input genome. We understand that it may be necessary to preprocess the input genome before encryption. The preprocessing can be necessary, for example, when the methods need to extract features from the genomes before encryption step. Since genomes are fairly large, we allow participants to perform any preprocessing that they would like to do. The participants are free to define the way they would like to encrypt the input genomes. We do not have any constraints (unlike last year) on the preprocessing of the input genome before it is encrypted AS LONG AS FOLLOWING CONDITIONS ARE MET ON THE TIME/MEMORY REQUIREMENTS:

Memory usage of preprocessing does not exceed 1 gigabytes for 2,000 genomes in the evaluation dataset.
Time usage of preprocessing does not exceed 1 minute for 2,000 genomes on 1 cpu core. We will perform the tests on an Intel Xeon CPU (See below for specifications) with no interference on the OS, i.e., no other programs runningwhile method are evaluated.

These conditions are necessary since using a lot of processing power before encryption defeats the purpose of outsourcing of the classification task. We also understand that the time usage is fairly vague but the participants should not aim at cutting close to this time limit and should optimize the amount of preprocessing roughly based on these. The steps of preprocessing must be described as clearly as possible and the source code/executables must be provided in the final submission. Please note that THE TIME AND MEMORY USAGE IN PREPROCESSING WILL BE COUNTED IN FINAL MEMORY/TIME USAGE WHILE METHODS ARE EVALUATED.
Challenge Dataset and Files: The challenge dataset contains complete genomes that are publicly available. There is one fasta file that contains genomic sequences for 4 different viral strains. For each strain, we provided 2,000 example genomes. Namely the strains are:

"B.1.427"
"B.1.1.7"
"P.1"
"B.1.526"

Fasta file format specification can be found here: https://en.wikipedia.org/wiki/FASTA_format. We believe this information should be enough to parse the fasta file as it is a fairly simple format: Each viral genome is specified by 2 lines in the file (16,000 lines in total). First line for each viral genome is an id. This genome id contains the strain identifier (such as "B.1.427" or "P.1"), underscore, another number. For example, first genome starts with:


	>B.1.427_5729

	AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCT....

This genome is of strain "B.1.427". The first ">" character and the number after underscore (i.e., "5729" in the example above) do not provide any information and they can be discarded. The second line is the actual sequence of the viral genome AS ONE LINE. Each genome is around 29,000 base pairs (i.e. letters out of {A,C,G,T,N}). In the example, the genome starts with "AGATCTGTT...".
Note that 'N' denotes an unspecified letter in the genome. The teams should decide on how they want to handle missing letters in the genomes.
The strain identifier defines that "class" of the viral genome. So the classifiers should learn to classify a new genome into 1 of the 4 strains. The ordering of the strains is not important as long as it is clearly described in the output.
The evaluation of the methods will be performed on a hold-out dataset that contains 500 viral genomes for each strain, i.e., 2000 genomes in total.
Encryption Requirement: The security level of HE schemes must be set at least 128 bits. We request teams to use the parameter settings in "5.4 TABLES of RECOMMENDED PARAMETERS" of HE standardization white paper (link). If a team wants to use sparse secret keys (less explored than non-sparse secrets), requesting team needs to explain the security in details (e.g., how the parameters are obtained using Martins' estimator).
Code submission and license agreement: We do not enforce the public release of code/binary but will encourage teams to make available their solutions under open-source software license for the entire community.
Evaluation Environments: All submissions will be evaluated using Docker container on physical servers. The container will be capped with 4 CPU cores (Intel Xeon Platinum 8180 CPU @ 2.50GHz), 32 GB memory, and 500 GB storage in evaluation.
Evaluation Criteria: The solutions will be evaluated in terms of classification performance and efficiency. We cap the total roundtrip time (including encryption, computation, and decryption) for all solutions at 30 minutes for the multi-label classification task. For solutions meeting the efficiency criterion, we will measure micro-AUC for the strainclassification performance and produce a final ranking.
Output API: Each submission should output a classification output for each strain (i.e., probabilities) in a .CSV file. The other output will be time (round trip, encryption, computation, decryption) (as a separate file). We are happy to accommodate slight changes in the output csv file as long as they are clearly described in the submission. When multiple models are submitted, they should share the same output API and allow users to select which model to conduct classification.
Dataset: link
Link to submission: link
FAQ: link

Track III: Confidential Computing
Background: Training a modern machine learning (ML) model often requires a large amount of data distributed across multiple organizations. Sometimes, however, data owners could be able to share their data, even in the encrypted form, due to the restrictions of their organizational privacy policies: e.g., hospitals often disallow medical records to leave their premises, even under crypto protection. Therefore, it becomes critical to enable two or more owners to jointly build a ML model without undermining the privacy protection of their individual data and in full compliance with their individual policies. This task is designed to understand the feasibility of building such a ML model among multiple collaborative parties so that the data of each organization does not leave its premise and the information (e.g., intermediate parameters of the model) exchanged between them during the computation is properly protected under differential privacy.
ML task: Transthyretin amyloid cardiomyopathy is a treatable while invisible cause of heart failure. It is crucial to identify the potential risk of the condition in patients with various health conditions. Huda et al. [1] proposed a machine learning approach to predict the risk of wild-type transthyretin amyloid cardiomyopathy using medical claims data.
Callenge: We challenge participating teams to implement a federate learning algorithm that can be trained jointly by two parties each holding their individual training datasets, for the purpose of determining whether a group of patients have the potential risk of wild-type transthyretin amyloid cardiomyopathy with known phenotypes (disease or not). The implementation should not share the input data but can exchange intermediate results (e.g., intermediate model parameters) between the parties involved in the training process; however, noise should be added to the intermediate results to ensure differential privacy under a given budget, across the whole learning process. Each team can choose any ML algorithm to accomplish the task.
Evaluation Criteria: Submissions are qualified if they meet the differential privacy requirement under a given privacy budget. Qualified solutions will be ranked based on their performances, including their prediction accuracy, total running time, and the communication cost (the rounds and sizes of data exchange). The solutions will be tested on the same privacy budgets, and ranked based on the prediction accuracy. For the solutions with similar accuracy (within 1% difference), they will be further compared based on their running time and communication cost. The evaluation team will run the training code on the released data for up to 24 hours. The solutions that do not complete within 24 hours will be disqualified.
Differential Privacy (DP) model: You can use either the conventional ε-DP definition (preferred) and the (ε,δ)-DP model. Each participating team may choose to submit two solutions each for one of these DP approaches. Please describe your approach clearly in the documentation. We will evaluate the solutions separately if there are enough submissions in each category.
Dataset: link[1]
FAQ: link
Baseline experiment: Our experiment used the random forest model [1] on the whole training dataset and achieved the accuracy ~0.84.
Baseline sample codes: link
Evaluation Environment: All submissions will be evaluated on two nodes with Intel Xeon E3-1280 v5 processor (4 physical cores, Hyper-Threading enabled) and 64 GiB memory.
Submission: The submitted solution should contain two programs for the training and testing purpose, respectively. You may choose to use docker containers for submission. The training program may take as input a dataset of patients’ heart condition medical claims with the same format as the given testing dataset, and output a model file (containing model parameters, etc, in the format recognizable by the testing program). The testing program takes as input the model file and makes the prediction for a given input instance (i.e., a new patient’s heart conditions). The training program should operate on two separate machines that communicate intermediate results (model parameters) during training. The testing program is a standalone program running a single computer. The submitted solution should also contain a readme file explaining the differential privacy method implemented in the training program and how to run the programs. Please contact Diyue Bu (diybu@indiana.edu) to get the link for the submission (please include your track number and team name in the email).
Reference

Huda, A., Castaño, A., Niyogi, A. et al. A machine learning model for identifying patients at risk for wild-type transthyretin amyloid cardiomyopathy. Nat Commun 12, 2725 (2021).

The submitted solutions will be only used by the organizers for evaluation purpose. The intellectual properties of the solutions are retained by the submitting team.