Three competition tasks
Track 1: Secure Relative Detection in (Forensic) Databases:
BACKGROUND
In this year's challenge, participants are asked to present a secure method using homomorphic encryption (HE) for detecting individuals in genomic databases. The method should take inputs as a query genome and a genetic database, then output whether the individual has relatives in the database. This approach enables a secure search for the target individual without compromising the privacy of the query individual or the genomic database. It also makes consent management more modular, as individuals can consent to secure searches but not cleartext searches. This is particularly relevant in the forensic domain, where using genetic genealogy databases (e.g., GEDMatch) to identify suspects and their relatives raises complex ethical issues, such as using genomic data without consent for forensic purposes.
GOAL
In this challenge, there are 3 entities:
- QE: Querying entity (such as law enforcement) that holds the genome of a target individual.
- DE: Database owner, who manages a genetic genealogy database.
- CE: Non-colluding trusted computing entity that performs genome detection using the encrypted data from QE and DE.
QE wants to find out if the genome of the target individual (or relatives) is in the database. Neither QE nor DE is allowed to reveal the genomic information to the other party. The main challenge is to perform this search in a secure manner using HE-based query: The information exchanged between the entities must be encrypted at every step.
CHALLENGE
In the challenge period, the participants will be provided with:
- The genomic database (DE) contains 2,000 genomes, each of which comprises genotypes for 16,000 genetic variants.
- The query genotype dataset (QE) contains 400 genomes with the genotypes for the same set of 16,000 variants.
- The matching individual identifier between the 2,000 genomes and the 400 genomes.
The participants must provide a query mechanism that takes the genomes from the query dataset and securely evaluates if any of the query individuals are related to any individuals in the genomic database (DE).
Reference
- Albrecht et al., Homomorphic Encryption Standard, https://eprint.iacr.org/2019/939.pdf
EVALUATION CRITERIA
In the evaluation period, the submissions will be evaluated with respect to the accuracy of detecting 400 query individuals (auROC) within a separate DE comprising 2,000 genomes. The query and database genomes will be different from the challenge period dataset with similar characteristics.
All submissions will also be benchmarked with respect to time usage in minutes.
A final score will be used to assign ranks to all submissions:
auROC per exp(wall time per 5 min) = (auROC / exp(total time in minutes / 5))
- exp(total time in minutes / 5) denotes the e^(wall_time_in_minutes / 5) for the submission.
- Submissions that require more than 10 minutes of wall time will be excluded from further evaluation.
- Submissions that use more than 1 Gigabytes (1024x1024x1024 bytes) of intermediate data (excluding the cleartext input data) will be excluded from further evaluation
LICENSE
This track requires every participating team to share their code and/or binaries under the BSD 3-Clause License Open Source license. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause License.
DATA USAGE AND PUBLICATION AGREEMENT
By registering and/or participating in this challenge and receiving restricted access to the challenge dataset, members of all teams agree to abide by the following rules of data usage:
- They will not share the challenge dataset with others.
- They will not use the challenge dataset in any publications until after the iDASH23 Workshop concludes.
These are set up to ensure fairness among the participating teams.
Data for Track 1 is deposited here: https://drive.google.com/file/d/1GkpLSdJ0xAO8gDxh_BiuEgGO2WA5S0qt
FAQ: https://docs.google.com/document/d/1K9or7E-ICvsYKRU_9zBdx3BoWll8wvjL1xzWWGnicQg
Track 2: Dynamic Patient Consent Management for Healthcare and Genomic Research Data Sharing Using Smart Contracts
Goal
To develop blockchain-based smart contracts for managing dynamic and hierarchical patient consent.
Experimental setting
Given a set of patients' consents that can be changed over time for hierarchical data elements, design a time/space efficient data structure and mechanisms to share (i.e., store and retrieve) data based on Ethereum Solidity.
Challenge
The input data include patients’ consent records where they may change their minds. Also, the data elements are grouped into categories. All data and intermediary data (e.g., index or cache of the original data) must be saved on-chain (i.e., no off-chain data storage is allowed) via smart contracts. We will provide the skeleton of the smart contracts and scripts. Note that the implementation of a smart contract is required to allow the insertion of one line of the patient consent record at a time. Each participant can determine how each insertion is represented and stored in the smart contract. Participants can implement any algorithm to store, retrieve and present the data correctly and efficiently. There will be two query functions: one for researchers and one for patients. Users should be able to query the data from any of the blockchain nodes.
Evaluation Criteria
The data-sharing system will need to demonstrate satisfactory performance (i.e., 100% accurate query results) on a test dataset, which will be different from the training set provided online. We will evaluate the efficiency of each solution using insertion, researcher query, and patient query times. We will insert patient consent for a specified time frame before conducting the queries.
Additional Rules
The submission should include one file of Solidity source code per team. Reusing any external code/library must follow the license agreement for both the code/library and our track (please see below the License section for more details), and the reusing code blocks must be clearly and explicitly cited using Solidity comments. One person can only participate in one team. All team members' names and emails must be listed in the Solidity comment in the submission code file, and team members cannot be changed after the submission due date. Although it is allowed to submit multiple times before the due date, only the last submission will be evaluated. If there are solution-wise communications across teams, it must also be disclosed in comments. The solutions of the teams not following the rules above will not be evaluated for fairness consideration.
License
This track requires every participating team to share their code and/or binaries under the BSD 3-Clause License Open Source license. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause License.
Data Usage and Publication Agreement
By registering and/or participating in this challenge and receiving restricted access to the challenge dataset, members of all teams agree to abide by the following rules of data usage:
- They will not share the challenge dataset with others.
- They will not use the challenge dataset in any publications until after the iDASH23 Workshop concludes.
These are set up to ensure fairness among the participating teams.
Reference, Data, and Code Skeleton
Please use this link to access Data and Readme files: Click here
FAQ: Click here
Reference:
Kim J, Kim H, Bell E, Bath T, Paul P, Pham A, Jiang X, Zheng K, Ohno-Machado L. Patient perspectives about decisions to share medical data and biospecimens for research. JAMA network open. 2019;2(8):e199550-e. doi: 10.1001/jamanetworkopen.2019.9550. PubMed PMID: 31433479.
Track 3: Confidential Computing for Pangenome-based Genome Inference
Organized by Haixu Tang and Xiaofeng Wang
Background
In the past few years, owning to the advance of single-molecule long read-sequencing technologies, the de novo assembled haplotype-resolved human genomes are becoming available. As a result, the variation-aware pangenome graph is proposed as a better representation of the human reference genomes than a linear genome sequence, which can be constructed from hundreds to thousands of haplotype-resolved human individual genomes [1]. On the other hand, short reads sequencing remains to be the practical solution for large cohort human genomic studies because of its high throughput and low cost. Therefore, the genome inference algorithm was proposed as an efficient alternative to the conventional reads mapping algorithm to generate the haplotype-resolved genomes by directly comparing the short-read sequences and the reference pan-genome graph [2]. On the other hand, it is well known that the sequencing reads from an individual human genome contains the complete genetic information of the human subject, and thus can be used to infer the identity of the human subject. Therefore, proper data protection should be in place when these data are re-analyzed (e.g., in a meta-study) by an untrusted user and/or on a public computing environment (e.g., on a public cloud). Trusted Executive Environment (TEE, e.g., AMD’s SEV) that is available in major cloud service providers (such as Azure) provides an ideal infrastructure for hosting the privacy-preserving pangenome-based genome inference by a data user, in which 1) only the computing task (i.e., genome inference) approved by the data owner of the input short-read sequencing data will be allowed on the data; and 2) the data user does not see the content of input sequencing data, which are encrypted in a way that only the TEE can decrypt. allowing for parallel computing) are required. The purpose of this task is to test how efficiently the pangenome-based genome inference algorithm can be executed in TEE, in particular using multiple enclaves (Secure VMs) provided by the AMD’ Secure Encrypted Virtualization (SEV) platform.
[1] Wang, Ting, et al. "The Human Pangenome Project: a global resource to map genomic diversity." Nature 604.7906 (2022): 437-446. [2] Ebler, Jana, et al. "Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes." Nature genetics 54.4 (2022): 518-525.
Challenge
We challenge participating teams to implement a parallel genome inference algorithm PanGenie [2], on the AMD’s SEV, so that the algorithm can operate on two virtual machines (VMs), using up to 4 threads in each VM. Note that here we attempt to simulate an elastic computing environment in a public cloud service, on which a single CPU package may be used by multiple jobs from different users, and thus only limited resources (e.g., 4 cores) are allocated to each VM when it is initiated. The implementation should protect both the input human genome sequencing data: that is, any input, intermediate and output data should be encrypted outside the enclave (including those communicated between VMs and those temporarily stored on the hard disk). However, we do not require the implementation of attestation, and do not consider side channel leaks in this task. We expect the solution will have a small Trusted Computing Base (TCB) so that it can be relatively more easily verified if needed, and can be efficiently executed on SEV.
Experimental setting
We will provide a testing dataset along with the reference implementation of PanGenie algorithm for genome inference outside SEV. Each team is challenged to implement an executable pipeline of PanGenie (including all its dependent libraries such as jellyfish) using SEV. For this purpose, the team is allowed to revise the existing implementation of PanGenie as well as the methods to exchange data and intermediate results (they should be encrypted under the security requirement) between virtual machines (VMs) as long as the genome inference results are largely preserved and the data are fully protected (except for side channel leaks). The testing datasets containing a larger number of reads will be used to evaluate the efficiency of the submitted solutions. The solution may utilize the computational resources outside the enclave, including the CPU, memory and hard disk, as long as all the data and the model are fully protected (encrypted at least 128-bit security level). The submitted solution cannot involve any additional party. The constructed pangenome graph will be provided as the input to PanGenie, and thus its construction will not be counted toward the performance of the solution. Additional computational time for pre-processing will be measured as part of the performance overhead.
Requirement
Each participating team should submit their implementation together with source code. We will provide remote access to an SEV system for the team to install their implementation for evaluation. The teams are responsible for the compatibility of their implementation and the system. The evaluation will be done using un-released testing data.
Evaluation Criteria
All submissions should meet security requirements (at least 128-bit security level). The exchange of intermediate data between the VMs should follow the security standard of Transport Layer Security (TLS). The solutions will then be evaluated based on the quality of the genome inference results when comparing with the reference results generated by the original PanGenie algorithm: qualified solution should output the variants (in the VCF file) that matches at least 99% of the reference output, and should have the size of the executable code (i.e., measuring the TCB of the solution) below a threshold of 1M. For qualified solutions, we then compare the solutions’ performance (execution time) to determine the winners when their performances come close (within 1% difference), we will compare their TCB (Trusted Computing Base), measured by the size of the whole executable code (including the libraries) running in the SEV (the solution with smaller TCB size is preferred).
Test Platform Information
We will use a virtual machine (VM) with 32 vCPUs (AMD-EPYC-3rd Milan CPU), 256 GB memory and 300 GB disk space. We plan to release a testing platform on Azure for participating teams to test their solutions before the final submission. The platform could be ready on Aug 15. If you are interested in using it, please contact Chen, Hongbo hc50@iu.edu. The instruction for final submission will be ready soon. Please check on the updates on the website.
System overview:
- Size: Standard DC32ads v5
- VCPUs: 32
- RAM: 128GB
- Disk: 1.2T
- OS: Ubuntu 22.04 LTS