Four competition tasks
Task 1: Blockchain-based Recording of Biomedical and Genomic Compliance Training Certificates
(organized by Tsung-Ting Kuo and Lucila Ohno-Machado)
Background: Training certificates help ensure that researchers accessing sensitive protected data have received sufficient training to hand it in compliance with all necessary regulations. Many such certificates are still in the form of PDF files (e.g., CITI [1]). Using the information contained in such training certificates can be critical for education, research, and other practical purposes. Individuals have access to their certificates and provide them to requesting institutions. Centralized solutions are not ideal since certificates are issued by various authorities and individuals want to have control over them. Additionally, centralized solutions such as databases incur the risk of serving as a single-point-of-failure.[2] The purpose of this task is to benchmark the use of blockchain, a decentralized technology, for storing and retrieving certificates.
Goal: The goal of this track is to develop smart contracts on a blockchain network to record and query human subjects’ training certificates or similar documents. Specifically, given a set of training certificate PDF files, which are truncated into multiple sections of binary data (called “chunks,” and their metadata (participant name, certificate name, date of completion, etc.), contestants will design time-efficient data structures and algorithms based on Ethereum Solidity to store and retrieve such data.
Challenge: All data (including chunks and their metadata) and intermediary data (e.g., index or cache of the original data) must be saved on-chain (i.e., no off-chain data storage is allowed) via smart contracts. We will provide the skeleton of the smart contracts and scripts. Note that the implementation of a smart contract is required to allow the insertion of one chunk and its accompanying metadata. Each participant can determine how each insertion is represented and stored in the smart contract. The two query functions should return (1) all metadata fulfilling a search criterion, and (2) the latest PDF file fulfilling a search criterion. Participants should not use any third party libraries. There will be n nodes (e.g., n = 5) in the blockchain network. Users should be able to query the data from any of the nodes. Participants can implement any algorithm to store, retrieve and present the data correctly and efficiently.
Evaluation: The data sharing system will need to demonstrate satisfactory performance (i.e., accurate query results) on a test dataset, which will be different from the training set provided online. The efficiency of each solution will be evaluated as follows. First, we will insert the certificate chunks and metadata lines from the nodes for a certain time (e.g., 30 minutes) in the blockchain network. Then, we will query all nodes to verify that the insertions were done correctly. Finally, we will compute the average number of records inserted per second (33.33%), the average time to run the first query (33.33%), and the average time to run the second query (33.33%) as our final evaluation metric. We plan to evaluate the results using Go-Ethereum 1.10.7, Solidity 0.8.12 (with ABI Encoder v2 enabled), and Ubuntu 18.04.5. Submissions will not be evaluated using any other platforms/versions for fairness. No modification of the underlying Go-Ethereum source code is allowed. The submission should include one file of Solidity source code.
License: This track requires every participating team to share their code and/or binaries under the BSD 3-Clause License Open Source license. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause License.
Data and Code Skeleton: link
FAQ: link
Reference: [1] Collaborative Institutional Training Initiative (CITI), https://about.citiprogram.org/en/homepage/
[2] Tellew J, Kuo T-T. CertificateChain: decentralized healthcare training certificate management system using blockchain and smart contracts. JAMIA Open. 2022;5(1). doi: 10.1093/jamiaopen/ooac019. PubMed PMID: 35571362.
(organized by Tsung-Ting Kuo and Lucila Ohno-Machado)
Background: Training certificates help ensure that researchers accessing sensitive protected data have received sufficient training to hand it in compliance with all necessary regulations. Many such certificates are still in the form of PDF files (e.g., CITI [1]). Using the information contained in such training certificates can be critical for education, research, and other practical purposes. Individuals have access to their certificates and provide them to requesting institutions. Centralized solutions are not ideal since certificates are issued by various authorities and individuals want to have control over them. Additionally, centralized solutions such as databases incur the risk of serving as a single-point-of-failure.[2] The purpose of this task is to benchmark the use of blockchain, a decentralized technology, for storing and retrieving certificates.
Goal: The goal of this track is to develop smart contracts on a blockchain network to record and query human subjects’ training certificates or similar documents. Specifically, given a set of training certificate PDF files, which are truncated into multiple sections of binary data (called “chunks,” and their metadata (participant name, certificate name, date of completion, etc.), contestants will design time-efficient data structures and algorithms based on Ethereum Solidity to store and retrieve such data.
Challenge: All data (including chunks and their metadata) and intermediary data (e.g., index or cache of the original data) must be saved on-chain (i.e., no off-chain data storage is allowed) via smart contracts. We will provide the skeleton of the smart contracts and scripts. Note that the implementation of a smart contract is required to allow the insertion of one chunk and its accompanying metadata. Each participant can determine how each insertion is represented and stored in the smart contract. The two query functions should return (1) all metadata fulfilling a search criterion, and (2) the latest PDF file fulfilling a search criterion. Participants should not use any third party libraries. There will be n nodes (e.g., n = 5) in the blockchain network. Users should be able to query the data from any of the nodes. Participants can implement any algorithm to store, retrieve and present the data correctly and efficiently.
Evaluation: The data sharing system will need to demonstrate satisfactory performance (i.e., accurate query results) on a test dataset, which will be different from the training set provided online. The efficiency of each solution will be evaluated as follows. First, we will insert the certificate chunks and metadata lines from the nodes for a certain time (e.g., 30 minutes) in the blockchain network. Then, we will query all nodes to verify that the insertions were done correctly. Finally, we will compute the average number of records inserted per second (33.33%), the average time to run the first query (33.33%), and the average time to run the second query (33.33%) as our final evaluation metric. We plan to evaluate the results using Go-Ethereum 1.10.7, Solidity 0.8.12 (with ABI Encoder v2 enabled), and Ubuntu 18.04.5. Submissions will not be evaluated using any other platforms/versions for fairness. No modification of the underlying Go-Ethereum source code is allowed. The submission should include one file of Solidity source code.
License: This track requires every participating team to share their code and/or binaries under the BSD 3-Clause License Open Source license. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause License.
Data and Code Skeleton: link
FAQ: link
Reference: [1] Collaborative Institutional Training Initiative (CITI), https://about.citiprogram.org/en/homepage/
[2] Tellew J, Kuo T-T. CertificateChain: decentralized healthcare training certificate management system using blockchain and smart contracts. JAMIA Open. 2022;5(1). doi: 10.1093/jamiaopen/ooac019. PubMed PMID: 35571362.
Task 2: Secure Model Evaluation on Homomorphically Encrypted Genotype Data
(organized by Miran Kim, Arif Harmanci, Xiaoqian Jiang)
Background: Prediction of phenotypes from high throughput genomic data is an important problem in genomics. Especially with the emergence of large genotype-phenotype databases, the models for phenotype prediction can provide an accurate estimation of phenotype information and disease risk. 3rd parties such as genealogy companies routinely use genetic data to estimate and report phenotype information to customers. Because of the sensitive nature of genetic and phenotypic data, the confidentiality of individuals, patients, and database participants is becoming a major challenge.
Challenge: We are challenging participants to build a secure outsourcing protocol for phenotype prediction whereby the genotype and model parameters are both protected. The participants will be asked to develop the prediction models (training in plaintext) and the secure model evaluation protocol (test in ciphertext). We expect the model to be trained in plaintext using the challenge dataset that contains genotypes and multiple phenotypes. The phenotypes will be based on simulations by organizers generated from complex functions of genotypes (due to restricted access requirements of existing datasets, we will not be able to provide real data for this challenge). The encrypted trained model will be used for the evaluation of homomorphically encrypted data (assuming it can be hosted on untrusted servers) for performance measurement. The main challenge stems from the optimization of model accuracy and evaluation performance. For example, more complex models can perform better predictions, but they may incur a high computational cost, which might render the computation practically infeasible.
Experimental setting: In this scenario, there are 3 parties: Client, Modeler, and Evaluator. The client wants to use their sensitive genotype data to perform phenotype prediction using the Modeler's models. Modeler builds the phenotype prediction models that take genotypes as input. Models contain sensitive information and cannot be shared in plain form. Therefore, Modeler releases their models only in encrypted form. The evaluator performs the outsourcing of model evaluation by taking encrypted genotype and encrypted model parameters. The challenge involves building the models (for Modeler) and the secure evaluation of the models on encrypted genotype data (for Evaluator). As described above, the models and genotype data are sensitive and must be encrypted before they are sent to the Evaluator. Submission site: https://uthtmc.az1.qualtrics.com/jfe/form/SV_8rlYH102Sl4e7qu
Requirement: This year, we do not allow any preprocessing operations on the genotype data before encryption, i.e., there can be no plaintext preprocessing of the genotype data before it is encrypted as these preprocessing operations can leak information about the prediction models. The participants can use any encryption scheme they wish as long as 128-bit security is satisfied as per the homomorphic encryption white paper [1].
Evaluation Criteria: All submissions will be tested on a hold-out dataset of 198 genomes in an isolated environment in terms of performance. Accuracy and time/memory requirements will be used for benchmarking and ranking the submissions. The submissions will be required to finish all protocol in less than 30 minutes. Please refer to README file for more details."
Data and Code Skeleton: link
FAQ: link
License: This track requires every participating team to share their code and/or binaries under the BSD 3-Clause License Open Source license. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause License.
Reference: [1] Albrecht et al., Homomorphic Encryption Standard, https://eprint.iacr.org/2019/939.pdf
(organized by Miran Kim, Arif Harmanci, Xiaoqian Jiang)
Background: Prediction of phenotypes from high throughput genomic data is an important problem in genomics. Especially with the emergence of large genotype-phenotype databases, the models for phenotype prediction can provide an accurate estimation of phenotype information and disease risk. 3rd parties such as genealogy companies routinely use genetic data to estimate and report phenotype information to customers. Because of the sensitive nature of genetic and phenotypic data, the confidentiality of individuals, patients, and database participants is becoming a major challenge.
Challenge: We are challenging participants to build a secure outsourcing protocol for phenotype prediction whereby the genotype and model parameters are both protected. The participants will be asked to develop the prediction models (training in plaintext) and the secure model evaluation protocol (test in ciphertext). We expect the model to be trained in plaintext using the challenge dataset that contains genotypes and multiple phenotypes. The phenotypes will be based on simulations by organizers generated from complex functions of genotypes (due to restricted access requirements of existing datasets, we will not be able to provide real data for this challenge). The encrypted trained model will be used for the evaluation of homomorphically encrypted data (assuming it can be hosted on untrusted servers) for performance measurement. The main challenge stems from the optimization of model accuracy and evaluation performance. For example, more complex models can perform better predictions, but they may incur a high computational cost, which might render the computation practically infeasible.
Experimental setting: In this scenario, there are 3 parties: Client, Modeler, and Evaluator. The client wants to use their sensitive genotype data to perform phenotype prediction using the Modeler's models. Modeler builds the phenotype prediction models that take genotypes as input. Models contain sensitive information and cannot be shared in plain form. Therefore, Modeler releases their models only in encrypted form. The evaluator performs the outsourcing of model evaluation by taking encrypted genotype and encrypted model parameters. The challenge involves building the models (for Modeler) and the secure evaluation of the models on encrypted genotype data (for Evaluator). As described above, the models and genotype data are sensitive and must be encrypted before they are sent to the Evaluator. Submission site: https://uthtmc.az1.qualtrics.com/jfe/form/SV_8rlYH102Sl4e7qu
Requirement: This year, we do not allow any preprocessing operations on the genotype data before encryption, i.e., there can be no plaintext preprocessing of the genotype data before it is encrypted as these preprocessing operations can leak information about the prediction models. The participants can use any encryption scheme they wish as long as 128-bit security is satisfied as per the homomorphic encryption white paper [1].
Evaluation Criteria: All submissions will be tested on a hold-out dataset of 198 genomes in an isolated environment in terms of performance. Accuracy and time/memory requirements will be used for benchmarking and ranking the submissions. The submissions will be required to finish all protocol in less than 30 minutes. Please refer to README file for more details."
Data and Code Skeleton: link
FAQ: link
License: This track requires every participating team to share their code and/or binaries under the BSD 3-Clause License Open Source license. The track co-organizers will upload all submitted code and/or binaries to a GitHub repository under the BSD 3-Clause License right after the competition results are announced. By submitting the code and/or binaries, the participants automatically consent to allow the track co-organizers to release the code and/or binaries under the BSD 3-Clause License.
Reference: [1] Albrecht et al., Homomorphic Encryption Standard, https://eprint.iacr.org/2019/939.pdf
Task 3: Confidential Computing for clustering single-cell transcriptomics data
(organized by Haixu Tang and Xiaofeng Wang)
Background: In the past few years, single-cell RNA-seq technologies have advanced rapidly. Un-supervised learning methods such as dimension reduction and clustering algorithms are now widely used to discover different types or subtypes of cells in hundreds to thousands of single cells e.g., from tumor or normal tissues, based on their gene expression profiles. Previous research has shown gene expression patterns may reveal identifiable information about the donor of the tissues; therefore, proper protection of such data should be in place when these data are re-analyzed (e.g., in a meta-study) by an untrusted user and/or on a public computing environment (e.g., on a commercial cloud). Trusted Executive Environment (TEE, e.g., Intel's SGX) provides an ideal infrastructure for hosting such privacy-preserving analyses of single-cell transcriptomics data by a data user, in which 1) only the computing task (i.e., clustering of single-cell gene expression profiles in this case) approved by the data owner of the input data is allowed be performed on the data; and 2) the data user does not see the content of input data, which are encrypted in a way that only the client and the TEE can decrypt. In 2020, we organized a challenge for efficient implementation of a single cell clustering algorithm within a single enclave using Intel's SGX technique. The results showed that the secure algorithm is efficient for clustering thousands of cells, but is not scalable well to tens of thousands of cells. As the current single cell RNA-seq analyses routinely involve tens to hundreds of thousands of cells, scalable clustering algorithms implemented using multiple enclaves (i.e., allowing for parallel computing) are required. The purpose of this task is to test the scalability of a parallel unsupervised clustering algorithm using multiple enclaves in SGX when applied to massive single-cell RNA-seq data.
Challenge: We challenge participating teams to implement a parallel clustering algorithm (such as Para-DPMM [1], but can be a different clustering algorithm) on the Intel SGX platform, so that the algorithm can operate on multiple enclaves using up to 4 threads in each enclave. Note that here we attempt to simulate an elastic computing environment in a public cloud service, on which a single CPU package may be used by multiple jobs from different users, and thus only limited resources (i.e., 4 threads) are allocated to each enclave when it is initiated. The implementation should protect both the input data: that is, any input, intermediate and output data should be encrypted outside the enclave (including communicated between enclaves). However, we do not consider side channel leaks in this task.
Experimental setting: We will provide a testing dataset along with the reference implementation of Para-DPMM clustering algorithm without using SGX. Each team is challenged to implement a parallel clustering algorithm (not necessarily DPMM) using SGX. For this purpose, the team is allowed to develop its own parallel algorithms as well as the methods to exchange data and intermediate results (they should be encrypted under the security requirement) between enclaves as long as the overall clustering accuracy is largely preserved and the data are fully protected (except for side channel leaks). The testing datasets of different sizes (number of cells and genes) will be used to evaluate the scalability of the submitted solutions. The solution may utilize the computational resources outside the enclave, including the CPU, memory and hard disk, as long as all the data and the model are fully protected (encrypted at least 128-bit security level). The submitted solution cannot involve any additional party. Pre-computing time will be measured as part of the performance overhead.
Requirement: Each participating team should submit their implementation together with source code. We will provide remote access to an SGX system for the team to install their implementation for evaluation. The teams are responsible for the compatibility of their implementation and the system. The evaluation will be done using un-released testing data.
Evaluation Criteria: All submissions should meet security requirements (at least 128-bit security level). The secure algorithms will then be evaluated based on the quality of the clustering results (measured by Adjusted Rand Index or ARI). We will also compare different submissions' performance (running time) when their accuracy comes close.
Test Platform Information
OS: Ubuntu 20.04
Kernel: 5.15.0-46-generic
CPU: Intel(R) Xeon(R) Platinum 8352Y (2-socket)
Memory: 256 GB
EPC Size: 128 GB
Data and Code Skeleton(password: iDash2022): link
FAQ: link
Reference: [1] Duan, T., Pinto, J.P. and Xie, X., 2019. Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics, 35(6), pp.953-961.
(organized by Haixu Tang and Xiaofeng Wang)
Background: In the past few years, single-cell RNA-seq technologies have advanced rapidly. Un-supervised learning methods such as dimension reduction and clustering algorithms are now widely used to discover different types or subtypes of cells in hundreds to thousands of single cells e.g., from tumor or normal tissues, based on their gene expression profiles. Previous research has shown gene expression patterns may reveal identifiable information about the donor of the tissues; therefore, proper protection of such data should be in place when these data are re-analyzed (e.g., in a meta-study) by an untrusted user and/or on a public computing environment (e.g., on a commercial cloud). Trusted Executive Environment (TEE, e.g., Intel's SGX) provides an ideal infrastructure for hosting such privacy-preserving analyses of single-cell transcriptomics data by a data user, in which 1) only the computing task (i.e., clustering of single-cell gene expression profiles in this case) approved by the data owner of the input data is allowed be performed on the data; and 2) the data user does not see the content of input data, which are encrypted in a way that only the client and the TEE can decrypt. In 2020, we organized a challenge for efficient implementation of a single cell clustering algorithm within a single enclave using Intel's SGX technique. The results showed that the secure algorithm is efficient for clustering thousands of cells, but is not scalable well to tens of thousands of cells. As the current single cell RNA-seq analyses routinely involve tens to hundreds of thousands of cells, scalable clustering algorithms implemented using multiple enclaves (i.e., allowing for parallel computing) are required. The purpose of this task is to test the scalability of a parallel unsupervised clustering algorithm using multiple enclaves in SGX when applied to massive single-cell RNA-seq data.
Challenge: We challenge participating teams to implement a parallel clustering algorithm (such as Para-DPMM [1], but can be a different clustering algorithm) on the Intel SGX platform, so that the algorithm can operate on multiple enclaves using up to 4 threads in each enclave. Note that here we attempt to simulate an elastic computing environment in a public cloud service, on which a single CPU package may be used by multiple jobs from different users, and thus only limited resources (i.e., 4 threads) are allocated to each enclave when it is initiated. The implementation should protect both the input data: that is, any input, intermediate and output data should be encrypted outside the enclave (including communicated between enclaves). However, we do not consider side channel leaks in this task.
Experimental setting: We will provide a testing dataset along with the reference implementation of Para-DPMM clustering algorithm without using SGX. Each team is challenged to implement a parallel clustering algorithm (not necessarily DPMM) using SGX. For this purpose, the team is allowed to develop its own parallel algorithms as well as the methods to exchange data and intermediate results (they should be encrypted under the security requirement) between enclaves as long as the overall clustering accuracy is largely preserved and the data are fully protected (except for side channel leaks). The testing datasets of different sizes (number of cells and genes) will be used to evaluate the scalability of the submitted solutions. The solution may utilize the computational resources outside the enclave, including the CPU, memory and hard disk, as long as all the data and the model are fully protected (encrypted at least 128-bit security level). The submitted solution cannot involve any additional party. Pre-computing time will be measured as part of the performance overhead.
Requirement: Each participating team should submit their implementation together with source code. We will provide remote access to an SGX system for the team to install their implementation for evaluation. The teams are responsible for the compatibility of their implementation and the system. The evaluation will be done using un-released testing data.
Evaluation Criteria: All submissions should meet security requirements (at least 128-bit security level). The secure algorithms will then be evaluated based on the quality of the clustering results (measured by Adjusted Rand Index or ARI). We will also compare different submissions' performance (running time) when their accuracy comes close.
Test Platform Information
OS: Ubuntu 20.04
Kernel: 5.15.0-46-generic
CPU: Intel(R) Xeon(R) Platinum 8352Y (2-socket)
Memory: 256 GB
EPC Size: 128 GB
Data and Code Skeleton(password: iDash2022): link
FAQ: link
Reference: [1] Duan, T., Pinto, J.P. and Xie, X., 2019. Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics, 35(6), pp.953-961.
Task 4: Secure Record Linkage
(organized by Haixu Tang and Xiaofeng Wang in collaboration with Ruiyu@Meta)
Background: Record linkage, which aims to link the records from two or more distinct sources, has been commonly adopted in biomedical informatics to link electronic health records (EHRs) of the same patient in different medical institutions, e.g., for building the longitudinal health profile of the patient or for de-duplicating individual records in multiple databases, etc. Due to the concerns of patients’ confidentiality and/or the institutional data sharing policy, the patient records, in particular the identity keys (e.g, the patients’ name, SSN, address, etc), cannot be directly shared with the other party. Therefore, a secure algorithm is needed to achieve record linkage while the protected keys are only shared in their encrypted form. In a typical scenario of secure record linkage, two parties each hold a private data set of patients’ records, and wish to identify the patients shared by both data sets based on the matched identity keys (e.g. first/last name, email, phone number, SSN, address/zip code, birth date, gender, etc). Notably, in practice, the record linkage is non-trivial because 1) keys may be missing, while the others can be inaccurate in both datasets, e.g., the address may be changed, the name may contain typos etc; and 2) one dataset may contain multiple records from the same or very similar (e.g., from the same family) patients. Here, the task is to implement a secure record linkage algorithm for two parties, each holding a private dataset of patients’ records, to identify the shared records using multiple, possibly non-unique, keys without sharing each party’s dataset with the other party in plaintext. We expect the algorithm is scalable to the datasets consisting of millions of records, and does not involve a trusted third party in the computation.
Challenge: We challenge the participating teams to implement a secure two-party computing algorithm that uses a scoring algorithm based on multiple identity keys to identify matched records. The scoring function should be designed to reflect the different weights of different keys: some keys have deterministic weights if they are not missing (e.g., SSN), while the other keys (e.g., birth dates, email address, etc) have lower weights to depending on their potential errors. The error rate and the probability of missing value for each key will be given in advance. The algorithm will output a list of matched records (IDs) ranked by the scores. The participating teams may consider (but not required) to use the MPC framework developed by Meta (https://github.com/facebookresearch/fbpcf) to implement their solutions.
Input format: A database of patient records (a set of keys) for each party. Each record in either database is assigned to a unique ID, which should only be used in the output. Some of the records have missing data. In some non-rare cases, multiple records share some identity keys (e.g. same first/last name, same address).
Output format: The two parties get a secret-shared list of IDs representing the matched records.
Security requirement: The two semi-honest parties should learn no extra information except:
1.Number of records in peer’s database
2.Number of matched records
3.Number of missing values for each identity key
Evaluation Criteria: Each team should submit a single solution. If you plan to submit multiple solutions that use substantially different approaches, please contact the organizers to get approval. All submitted solutions should fulfill the security requirement; otherwise, the submission will be disqualified. The qualified solutions will be ranked based on their accuracy (i.e., outputting the correct list of matched records). For the solutions with the same or very close (i.e., within 1% difference), we will evaluate the speed of the algorithm. For detail of the data, please see it in the FAQ. The guideline for submitting your solutions will be available at the website later(Please send an email to zhu11@iu.edu if your team wants to submit your result. Your email should include your team name. We will create one drive folder for your team, and you should submit your solution to this one drive folder. Note that each team can only submit one solution, so please send only one email to us if your team wants to submit your result.) We will run the program no more than 24 hours. You may use up to 32G memory and the CPU is quad.
Data and Code Skeleton(Updated at Aug 11, password:iDash2022): link
FAQ: link
Submission: Each participating should submit the solution in two containers based on official Ubuntu docker image, of which each can be run on an independent server running Ubuntu. The containers should set up a communication channel to exchange the (protected) input and other necessary data. Each team is expected to also submit the source code of their solution. The source code will only be used for evaluation purpose (i.e., checking the security compliance) by the organizers.
(organized by Haixu Tang and Xiaofeng Wang in collaboration with Ruiyu@Meta)
Background: Record linkage, which aims to link the records from two or more distinct sources, has been commonly adopted in biomedical informatics to link electronic health records (EHRs) of the same patient in different medical institutions, e.g., for building the longitudinal health profile of the patient or for de-duplicating individual records in multiple databases, etc. Due to the concerns of patients’ confidentiality and/or the institutional data sharing policy, the patient records, in particular the identity keys (e.g, the patients’ name, SSN, address, etc), cannot be directly shared with the other party. Therefore, a secure algorithm is needed to achieve record linkage while the protected keys are only shared in their encrypted form. In a typical scenario of secure record linkage, two parties each hold a private data set of patients’ records, and wish to identify the patients shared by both data sets based on the matched identity keys (e.g. first/last name, email, phone number, SSN, address/zip code, birth date, gender, etc). Notably, in practice, the record linkage is non-trivial because 1) keys may be missing, while the others can be inaccurate in both datasets, e.g., the address may be changed, the name may contain typos etc; and 2) one dataset may contain multiple records from the same or very similar (e.g., from the same family) patients. Here, the task is to implement a secure record linkage algorithm for two parties, each holding a private dataset of patients’ records, to identify the shared records using multiple, possibly non-unique, keys without sharing each party’s dataset with the other party in plaintext. We expect the algorithm is scalable to the datasets consisting of millions of records, and does not involve a trusted third party in the computation.
Challenge: We challenge the participating teams to implement a secure two-party computing algorithm that uses a scoring algorithm based on multiple identity keys to identify matched records. The scoring function should be designed to reflect the different weights of different keys: some keys have deterministic weights if they are not missing (e.g., SSN), while the other keys (e.g., birth dates, email address, etc) have lower weights to depending on their potential errors. The error rate and the probability of missing value for each key will be given in advance. The algorithm will output a list of matched records (IDs) ranked by the scores. The participating teams may consider (but not required) to use the MPC framework developed by Meta (https://github.com/facebookresearch/fbpcf) to implement their solutions.
Input format: A database of patient records (a set of keys) for each party. Each record in either database is assigned to a unique ID, which should only be used in the output. Some of the records have missing data. In some non-rare cases, multiple records share some identity keys (e.g. same first/last name, same address).
Output format: The two parties get a secret-shared list of IDs representing the matched records.
Security requirement: The two semi-honest parties should learn no extra information except:
1.Number of records in peer’s database
2.Number of matched records
3.Number of missing values for each identity key
Evaluation Criteria: Each team should submit a single solution. If you plan to submit multiple solutions that use substantially different approaches, please contact the organizers to get approval. All submitted solutions should fulfill the security requirement; otherwise, the submission will be disqualified. The qualified solutions will be ranked based on their accuracy (i.e., outputting the correct list of matched records). For the solutions with the same or very close (i.e., within 1% difference), we will evaluate the speed of the algorithm. For detail of the data, please see it in the FAQ. The guideline for submitting your solutions will be available at the website later(Please send an email to zhu11@iu.edu if your team wants to submit your result. Your email should include your team name. We will create one drive folder for your team, and you should submit your solution to this one drive folder. Note that each team can only submit one solution, so please send only one email to us if your team wants to submit your result.) We will run the program no more than 24 hours. You may use up to 32G memory and the CPU is quad.
Data and Code Skeleton(Updated at Aug 11, password:iDash2022): link
FAQ: link
Submission: Each participating should submit the solution in two containers based on official Ubuntu docker image, of which each can be run on an independent server running Ubuntu. The containers should set up a communication channel to exchange the (protected) input and other necessary data. Each team is expected to also submit the source code of their solution. The source code will only be used for evaluation purpose (i.e., checking the security compliance) by the organizers.