Three tracks of competition tasks
Track 1: Blockchain-based immutable logging and querying for cross-site genomic dataset access audit trail
The goal of this track is to develop blockchain-based ledgering solutions to log and query the user activities of accessing genomic datasets (e.g., GTEx) across multiple sites.
Experimental setting: Given a genomic data access log file, design a time/space efficient data structure and mechanisms to store and retrieve the logs based on MultiChain version 1.0.4 (https://www.multichain.com/download-install/).
Challenge: Each line in the data access log must be saved individually as one transaction (i.e., participants cannot save entire file in just one transaction), and all log data and intermediate data (such as index or cache) must be saved on-chain (no off-chain data storage allowed). It is up to you to determine how you choose to represent and store each line in transactions. It does not need to be a plain text copy of the log entry. Also, the query implementation should allow a user to search using any field of one log line (i.e., node, id, user, resource, activity, timestamp, and a “reference id” referring to the id of the original resource request), any “AND” combination (e.g., node AND id AND user AND resource), and any timestamp range (e.g., from 1522000002418 to 1522000011441) using a command-line interface. Also, the user should be able to sort the returning results in ascending/descending order with any field (e.g., timestamp). There will be 4 nodes in the blockchain network, and 4 log files to be stored. Users should be able to query the data from any of the 4 sites. Participants can implement any algorithm to store, retrieve and present the log data correctly and efficiently.
Requirement: Submission requires open source code.
Evaluation Criteria: The logging/querying system needs to demonstrate good performance (i.e., accurate query results) by using a testing dataset, which is different from the one provided online. We will evaluate the speed, storage/memory cost, and scalability of each solution. We will use the binary version of MultiChain 1.0.4 on 64-bit Ubuntu 14.04 with the default parameters as the test bed for fairness. No modification of the underlying MultiChain source code is allowed. The participants must consent to release any code or binaries submitted to the competition under the GNU General Public License v3.0 Open Source license. The submitted executable binaries should be non-interactive (i.e., depend only on parameters with no input required while it works), and should contain a readme file to specify the parameters. We will test all submissions using 4 virtual machines, each with 2-Core CPU, 8GB RAMs and 100GB storage.
Dataset: link
Track 2: Secure Parallel Genome Wide Association Studies using Homomorphic Encryption
The goal of this track is to develop a secure outsourcing solution to compute Genome Wide Association Studies (GWAS) based on homomorphically encrypted data.
Experimental setting: Given encrypted genome data that contain thousand of Single Nucleotide Polymorphisms (SNPs) and hundreds of samples), participates are asked to outsource the storage and computation on a third party server to carry out the Genome Wide Association Studies (based on linear or binary logistic regression) to compute the p-values of different SNPs.
Challenge: Training logistic regression models on encrypted data is a task of our 2017 competition. Although significant performance improvements over existing solutions have been demonstrated, it is still quite computational intensive. Direct implementation of linear or logistic regression based GWAS would require building one model for each SNP, which requires a lot of time. We will use an alternative semi-parallel algorithm for this competition and rely on an approximation to reduce the necessary rounds of computation. The challenge here is to develop efficient packing and parallelization algorithms to make full use of the memory and computation resources.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/23711206
Example code: https://bitbucket.org/ksikorska/gwasp (check fast_covariates.r)
Evaluation Criteria: Each team should submit their solutions in executable forms (and source code if possible). We will check the security compliance (at least 128-bit security level), memory usage and speed. The solution should be non-interactive and we will change the input size (in terms of SNPs and cohort size) to test the scalability of submitted solutions.
Dataset: link
Track 3: Secure search of DNA segments in large genome databasesDatabases of whole genome-scale genotypes are becoming available. Searching these databases for matched DNA segments (consecutive variant matches) will allow identification of distant relatives. However, privacy concerns must be addressed before a practical querying systems being adopted. The goal of this track is to develop a secure two-party computation solution to identify set-maximal matches [1] (longer than a threshold) between a query genome and a large database of genotypes, X = {x_1, x_2, … , x_M} where x_i represents the vector of genotypes for i-th genome for a vector of SNPs, S, where SNPs are sorted with respect to their locations on the genome. Given X and a query genotype set z, a list of consecutive SNPs between SNP indices between scalars a and b (S_{[a,b]}) (a < b) is set-maximal if
Challenge: assumes the database genomes do not contain errors; as a result, the output genomic segments represent perfect matches between the corresponding database and query genomesFor testing purposes, we provide a databases of 1000 genomic sequences, each genome of 100,000 SNPs. We also provide a few sample queries, as well as the expected output from these queries.
Reference:
[1] Durbin, Richard. "Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)." Bioinformatics 30.9 (2014): 1266-1272.
[2] Shimizu, Kana, Koji Nuida, and Gunnar Rätsch. "Efficient privacy-preserving string search and an application in genomics." Bioinformatics 32.11 (2016): 1652-1661.
Example code (for non-secure computation): https://github.com/richarddurbin/pbwt
Evaluation Criteria: Each team should submit their solutions in executable forms (and source code if possible). Each team can choose to preprocess (e.g., indexing) the database, which can be implemented in a non-secure algorithm. In this case, the preprocessing algorithm and the querying algorithm (for secure two-party computation) should be implemented and submitted separately. We will evaluate the submitted solutions using datasets of similar or larger size (different from the provided testing data) to test the scalability of the submitted solutions. We will check the security compliance (at least 128-bit security level), accuracy of the results (expected to be the same or very close the results from the non-secure algorithm), speed (database preprocessing time and query time are considered separately) as well as the size of the transferred data. The solution should be non-interactive and we will change the input size (in terms of SNPs and cohort size) to test the scalability of submitted solutions.
Dataset: link
Accessing the AWS instance(s): We provide "m5d.xlarge" instance(s) on AWS for the tasks above. The participants are expected to access the machine(s) via SSH and the private key ("kpr-2.pem") is provided upon request.
US east:
US west:
The goal of this track is to develop blockchain-based ledgering solutions to log and query the user activities of accessing genomic datasets (e.g., GTEx) across multiple sites.
Experimental setting: Given a genomic data access log file, design a time/space efficient data structure and mechanisms to store and retrieve the logs based on MultiChain version 1.0.4 (https://www.multichain.com/download-install/).
Challenge: Each line in the data access log must be saved individually as one transaction (i.e., participants cannot save entire file in just one transaction), and all log data and intermediate data (such as index or cache) must be saved on-chain (no off-chain data storage allowed). It is up to you to determine how you choose to represent and store each line in transactions. It does not need to be a plain text copy of the log entry. Also, the query implementation should allow a user to search using any field of one log line (i.e., node, id, user, resource, activity, timestamp, and a “reference id” referring to the id of the original resource request), any “AND” combination (e.g., node AND id AND user AND resource), and any timestamp range (e.g., from 1522000002418 to 1522000011441) using a command-line interface. Also, the user should be able to sort the returning results in ascending/descending order with any field (e.g., timestamp). There will be 4 nodes in the blockchain network, and 4 log files to be stored. Users should be able to query the data from any of the 4 sites. Participants can implement any algorithm to store, retrieve and present the log data correctly and efficiently.
Requirement: Submission requires open source code.
Evaluation Criteria: The logging/querying system needs to demonstrate good performance (i.e., accurate query results) by using a testing dataset, which is different from the one provided online. We will evaluate the speed, storage/memory cost, and scalability of each solution. We will use the binary version of MultiChain 1.0.4 on 64-bit Ubuntu 14.04 with the default parameters as the test bed for fairness. No modification of the underlying MultiChain source code is allowed. The participants must consent to release any code or binaries submitted to the competition under the GNU General Public License v3.0 Open Source license. The submitted executable binaries should be non-interactive (i.e., depend only on parameters with no input required while it works), and should contain a readme file to specify the parameters. We will test all submissions using 4 virtual machines, each with 2-Core CPU, 8GB RAMs and 100GB storage.
Dataset: link
Track 2: Secure Parallel Genome Wide Association Studies using Homomorphic Encryption
The goal of this track is to develop a secure outsourcing solution to compute Genome Wide Association Studies (GWAS) based on homomorphically encrypted data.
Experimental setting: Given encrypted genome data that contain thousand of Single Nucleotide Polymorphisms (SNPs) and hundreds of samples), participates are asked to outsource the storage and computation on a third party server to carry out the Genome Wide Association Studies (based on linear or binary logistic regression) to compute the p-values of different SNPs.
Challenge: Training logistic regression models on encrypted data is a task of our 2017 competition. Although significant performance improvements over existing solutions have been demonstrated, it is still quite computational intensive. Direct implementation of linear or logistic regression based GWAS would require building one model for each SNP, which requires a lot of time. We will use an alternative semi-parallel algorithm for this competition and rely on an approximation to reduce the necessary rounds of computation. The challenge here is to develop efficient packing and parallelization algorithms to make full use of the memory and computation resources.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/23711206
Example code: https://bitbucket.org/ksikorska/gwasp (check fast_covariates.r)
Evaluation Criteria: Each team should submit their solutions in executable forms (and source code if possible). We will check the security compliance (at least 128-bit security level), memory usage and speed. The solution should be non-interactive and we will change the input size (in terms of SNPs and cohort size) to test the scalability of submitted solutions.
Dataset: link
Track 3: Secure search of DNA segments in large genome databasesDatabases of whole genome-scale genotypes are becoming available. Searching these databases for matched DNA segments (consecutive variant matches) will allow identification of distant relatives. However, privacy concerns must be addressed before a practical querying systems being adopted. The goal of this track is to develop a secure two-party computation solution to identify set-maximal matches [1] (longer than a threshold) between a query genome and a large database of genotypes, X = {x_1, x_2, … , x_M} where x_i represents the vector of genotypes for i-th genome for a vector of SNPs, S, where SNPs are sorted with respect to their locations on the genome. Given X and a query genotype set z, a list of consecutive SNPs between SNP indices between scalars a and b (S_{[a,b]}) (a < b) is set-maximal if
- There is an x_i \in X such that genotypes of x_i and z match exactly; x_{i,k}=z_k for k\in[a,b]
- S[a,b] cannot be extended without genotype mismatches between x_i and z, i.e., x_{i,(a-1)} \neq z_{a-1} or x_{i,(b+1)} \neq z_{b+1}
- There is no x_j \in X (j \neq i) such that genotypes of z and x_j exactly match.
Challenge: assumes the database genomes do not contain errors; as a result, the output genomic segments represent perfect matches between the corresponding database and query genomesFor testing purposes, we provide a databases of 1000 genomic sequences, each genome of 100,000 SNPs. We also provide a few sample queries, as well as the expected output from these queries.
Reference:
[1] Durbin, Richard. "Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)." Bioinformatics 30.9 (2014): 1266-1272.
[2] Shimizu, Kana, Koji Nuida, and Gunnar Rätsch. "Efficient privacy-preserving string search and an application in genomics." Bioinformatics 32.11 (2016): 1652-1661.
Example code (for non-secure computation): https://github.com/richarddurbin/pbwt
Evaluation Criteria: Each team should submit their solutions in executable forms (and source code if possible). Each team can choose to preprocess (e.g., indexing) the database, which can be implemented in a non-secure algorithm. In this case, the preprocessing algorithm and the querying algorithm (for secure two-party computation) should be implemented and submitted separately. We will evaluate the submitted solutions using datasets of similar or larger size (different from the provided testing data) to test the scalability of the submitted solutions. We will check the security compliance (at least 128-bit security level), accuracy of the results (expected to be the same or very close the results from the non-secure algorithm), speed (database preprocessing time and query time are considered separately) as well as the size of the transferred data. The solution should be non-interactive and we will change the input size (in terms of SNPs and cohort size) to test the scalability of submitted solutions.
Dataset: link
Accessing the AWS instance(s): We provide "m5d.xlarge" instance(s) on AWS for the tasks above. The participants are expected to access the machine(s) via SSH and the private key ("kpr-2.pem") is provided upon request.
US east:
ssh -i "kpr-2.pem" cadet@ec2-18-222-20-158.us-east-2.compute.amazonaws.com
US west:
ssh -i "kpr-2.pem" cadet@ec2-54-153-102-241.us-west-1.compute.amazonaws.com