Competition tasks - iDASH Privacy & security workshop 2015

Question: What is the configuration of the VM?

Reply: The VM image (link) has 4 cores with 4 GB memories and 200 GB disk, which can be launched using VMware player.

Question: What are the criteria when judging the submissions?

Reply: The implemented programs will be tested on a VM and the performance will be measured by the overall running time (including communication for challenge ) and data exchange as well as the precision of calculated statistics.

Question: What is the hardware specification used for the judgment? Does it support hardware AES-NI? how many cores in the CPU? What is the network bandwidth?

Reply: We provide a VM template (link). The image has 4 cores with 4 GB memory and 200 GB disk. It does not support hardware AES-NI and the bandwidth is unto 10Mbps.

For challenge 1:

Question:: It is asked to use homomorphic encryption: does this mean specifically *fully* (or "somewhat" fully) homomorphic encryption, semi-homomorphic, or either?

Reply: The participating team can use any kind of homomorphic encryption schemes (e.g., fully, semi, or partial homomorphic encryption schemes).

Question: Beside the initial encryption, are any interaction allowed between the parties (random share protocols etc)?

Reply: The challenge 1 mainly focuses on the Homomorphic encryption. Interaction between the data owner and remote cloud are only limited to upload the data and obtain the result. In contrast, challenge 2 with the same tasks allows each participating team to use any SMC protocols.

Question: If not, it seems like certain tasks such as returning the MAF would not be possible without breaking basic security assumptions.

Reply: As the challenge 1 focuses on the secure outsourcing computation, the data owner can be involved in certain calculation. For example, the data owner can encrypt the genotype data and store these information in a remote cloud. Then, the untrusted remote cloud can compute the allele frequencies, and the denominator and the numerator in chi-square test. Then, the data owner can help compare the MAF and calculate the final Chi-square statistics. We also encourage any other advanced design, where the goal of this task is to maximize the computational outsourcing and minimize the interaction/communication.

Question: Finally, we are not sure what is meant by "80-bit security level": which could refer to many different things, entirely dependent on the choice of crypto system.

Reply: 80-bit security level means that it will take a computer, on average, approximately 2^80 operations to break the encryption. For different crypto systems, the participant teams should find a set of parameters to achieve the equivalent security level.

Question: For task 1, is it that the data provided is the data used in the judgment?

Reply: No, we will use reserved data to test and the format and scale stay the consistent with provided data.

Question: For chi-square, what is the desire precision level? 1e-3? 1e-6? relative error or absolute error?

Reply: Minimum precision for chi-square is 10^-3. Desired precision is around 1e^7 for absolute error.

For challenge 2:

Question: Data format. Will the data come in a file in the same format as in the provided files? In particular, for task 1, both of the provided files are said to include data for two contributing institutions, while I would expect institution 1 (and similarly institution 2) to have their own data only. Can we separate the data into two files?

Reply: Yes, when we test your program, we will use the data in the same format as the example shown on the web site. You are right that for task 1, the current data contain data for both institutions, which will be partitioned into two sets consisting of the genotypes on non-overlapping human subjects. We will update the website providing an example of partitioned data (that has two datasets, each for one participating institution.

Question: How many nodes are allowed for SMC?

Reply: We will allow up to three remote nodes, instantiated from our VM templates and the third node should carry as little computation as possible.

Question: Input size. Could you comment on the size of the inputs (or at least their upper bounds) on which the programs will be run? In particular, for task 1, the combined number of participants (from institution 1 and institution 2) and the number of SNPs define the input size. Can we assume that the entire input can be read into the memory (and processed within the memory, where the size of allocated memory will have to be larger than the input size)?

Reply: The training set has 200 individuals. The test set will be no more than 300. Yes, all the input can be read into the memory.

Question: MAF computation. It is correct that for MAF computation there will be only one input per file institution (as opposed to case group and control group files)? Is it also OK to produce output in the form 4/10 (or, say, 54/200) as in your example? Is it OK to assume that the program will need to process all SNPs in the input file and then output a sequence of frequencies (e.g., 3/10, 4/10, 1/10, ...)? Lastly, is a SNP guaranteed to have exactly two alleles?

Reply: There should be a partition in each of the case and control group, which should be combined to compute MAF (please see my comments above).

4/10 (or 54/200) is okay. In fact, the total number of human subjects in the combined data is assumed to be public information. You can program can output a sequence of frequencies (or counts) in the same order

of the SNPs as in the input file; or alternatively, to output a table with two columns for the SNP ID and the frequencies, respectively. Finally, in our testing case, we assume SNPs always have two alleles.

Question: Chi-square test. Similar to the MAF computation above, I assume that one value per SNP will need to be reported after processing all SNPs in the input files. As you mentioned earlier, 10^-7 precision is sufficient for this test, which means that I need to compute 24 bits of the chi-squared value. Is my understanding correct?

Reply: We expect a precision at least 10^-3. The more precise the better but 10^-7 is sufficient for the task. This will also be part of the final evaluation.

Question: Execution. Is it correct to assume that institution 1 will have its input data stored in a file (or files) and institution 2 will have its input data stored in a file when the execution begins, but they can't exchange input-dependent data prior to that?

Reply: Yes.

Question: Similarly, any third party used in the computation will be assumed to have no input-dependent data upon computation commencement.

Reply: Yes

Question: Will the output need to be produced by both input parties or is having one of them produce the output going to suffice?

Reply: One party is enough.

Question: Will the programs start in all distributed locations at exactly the same time for timing purposes or otherwise what would the mechanism for discounting idle time while waiting for other parties to start be? (We perhaps can include time measurements of meaningful work from inside the program, but there may need to be a signaling mechanism to indicate that all parties are now online.)

Reply: We will evaluate the execution time on both sides. We acknowledge that there will be variations due to synchronization and network delays, and therefore plan to rerun the programs to suppress the randomness.

Most importantly, we will evaluate all participants' code in the same way, through multiple trials, to ensure the fairness.

Question: Can you provide information about how each program can locate other parties (e.g., by including their IP addresses in the program), that would ensure that the parties can compute together. (IP addresses are not needed at this point, but rather the knowledge that parties will know IP addresses of each other.)

Reply: We will adjust it through the evaluation. BTW, we will use domain name instead of IP addresses, which can be dynamic.

Question: Will we be allowed to compute any information "offline" (modeling computation done before the two parties interact, or before the genomes are received).

Reply: For task 1 in challenge 1, the participating team can implement different designs for the genotyping data encryption, but allele frequency and chi-squared test must be calculated on the server side with encrypted genotyping data.

*For task 2 in challenge 1, each team can preprocess the VCF file before encryption, but the comparison operation (i.e., max(D(x), D(y))) and distance calculation (e.g., D(x) = 10 + len(x)) must be done on the server side using HME. For example, you can choose how to encrypt ‘x’ for calculating ‘len(x)’.

We can tolerant some minimum client participation after encrypted outcomes are received (to decrypt it and do some minimum calculation, e.g., division to obtain the final results). But preference will be given to solutions that can do everything in the encrypted form.

Question: You mention we will be graded on both performance and precision. What is the tradeoff in terms of evaluation criteria? Will competitors with equal precision be compared together?

Reply: We would prefer better performance, as long as the results achieve the minimum precision as 10^-3. Teams with similar performances in terms of execution time will be further compared with their precisions.

Question: Should our submissions be in a specific format in order to interface with your evaluation framework? In addition are we allowed to amend the packages installed on the VM (which would ultimately be required for installation on the test machine as well)?

Reply: You can install any package in the VM. We will evaluate the performance of your submission on another testing data in the VM. Therefore, we would prefer if you could upload your pre-configed VM for evaluation

Question: Are both parties intended to run on the same virtual machine for challenge one? Will the evaluation be done on a single machine as well?

Reply: For the challenge one, we can put both the ‘client' and ‘server' in the same VM, where the ‘client' encrypts the data using Homomorphic encryption and the server calculates either task 1 or 2 over the encrypted data. Finally, the client decrypts the results. The VM will be the single VM running in a physics machine.

Question: What kind of testing data will be used for evaluating Edit distance calculation?

Reply: We will use two sequences for testing

Seq1: 4,542,542 variations

Seq 2: 4,368,847 variations

There are 2,336,793 common variations between them. The rest ~ 2M variations are individual variations (including SNP, SUB, INS, and DEL). There are around 400k indel (insertion + deletion) in each of these sequences. The position range and counts on each chromosome of these two sequences are calculated below.

seq 1

chrom ID, min max position, # of variations

1 [30867, 249238538] 328786

2 [18505, 243185845] 364396

3 [60596, 197931689] 312841

4 [12837, 191043251] 324701

5 [12224, 180789191] 289287

6 [67960, 171050996] 294629

7 [12782, 159128336] 260517

8 [17404, 146301337] 241479

9 [31018, 141142517] 183632

10 [60968, 135508256] 223465

11 [66909, 134945709] 218189

12 [61098, 133841236] 208803

13 [19021373, 115109786] 161210

14 [19025578, 107289204] 145224

15 [20006753, 102521054] 132244

16 [61266, 90284744] 145509

17 [302, 81194198] 125940

18 [11256, 78017072] 128253

19 [61292, 59113910] 106176

20 [61794, 62964221] 98898

21 [9411601, 48119802] 70692

22 [16051294, 51230502] 58494

M [72, 16518] 61

X [60002, 155259520] 115149

Y [2660546, 59033299] 3967

seq2

chrom ID, min max position, # of variations

1 [38231, 249239872] 328792

2 [11319, 243168368] 358811

3 [60595, 197948402] 299315

4 [10430, 191043338] 320977

5 [12008, 180853773] 268619

6 [60210, 171051046] 274716

7 [12361, 159128633] 238612

8 [15716, 146301337] 228112

9 [10868, 141144794] 176402

10 [60968, 135522699] 213151

11 [66909, 134945709] 228990

12 [82810, 133841395] 204254

13 [19021222, 115109786] 162472

14 [19000059, 107288875] 146057

15 [20000994, 102520787] 131262

16 [60213, 90289070] 133985

17 [302, 81194198] 122052

18 [10658, 78016911] 117993

19 [61292, 59118039] 96119

20 [61097, 62965184] 89682

21 [9411784, 48119668] 73137

22 [16052238, 51224645] 59226

M [72, 16361] 28

X [63533, 155255660] 94029

Y [2661693, 59033431] 2054

We will extract up to 100k (sampled uniformly randomly within a selected region of the ~4M indexes from each sequence) to generate VCF files in order to test your algorithm. We will upload the positions of both full sequences on the website.

http://www.humangenomeprivacy.org/2015/data/vcf/sequencePosition.zip

(first column is chromosome and second column is the position)

A sample python code for calculating edit distance on our provided VCF files are also provided here

Question: Task 1: We assume that the total number of subjects in case and control groups is public, which implies that the number of subjects from each organization is also public (since it is computable from one's input and the total number of subjects). Is it fair to make this assumption?

Reply: Yes, this is reasonable.

Question: Task 1: Is there an expected number of records (or an upper bound on the number of records) in files on which this task will be run?

Reply: We can assume an upper bound is given (for example, in this case, say 10 millions).

Question Task 2: Whether full padding is necessary or the parties are going to know certain information about the expected percentage of records of each type, in which case the number of records in the padding can be smaller. For example, if records of DEL and INS types are always present and constitute, say, no less than 10% of the total number of records, then we need to pad the input to 90% of the file size instead of all 100%. Similarly, if SUB records are not very common and their number never exceeds a certain percentage (e.g., 20%), we would like to have different procesdures for processing SNP and SUB records, where both lists are padded to sufficient sizes to hide the exact number of entries of type SNP and SUB. Can we make such assumptions and have each organization enter two lists of SNP and SUB records, respectively (padded to hide the exact number of records)? If so, what would be safe padding bounds?

Reply: It is reasonable to assume a (relatively wide) range of the fraction of SUB/SNP in each file. For example, we can assume SUB will constitute 5%-30% of the records, if that is useful. But we cannot assume range to be very specific.

Question Task 2: In a previous email we asked about the maximum length of REF and ALT sequences that may exist in the input file, your answer was somehow arbitrary where you set an arbitrary maximum size of 100. However, we need the exact maximum length since it affects our work. Is there a maximum length for these fields?

Reply: It can only be estimated ad hoc right now. According the 1000 genomes project, the novel sequence insertion can be rarely quite long (e.g., >1000 bp). So if you need an absolute upper bound in real case, it can be quite large number. However, in our testing purpose, we can make sure no insertion sequence will be longer than 100 bps.

Question: As you may know certain SMC techniques (MPC in the so called preprocessing-model) require a preprocessing phase before secure computation can be done. The preprocessing generates some shared data between the parties that will be used during the secure computation it self. However, this preprocessing is not depended on the input to the computation. E.g., in this case the preprocessing phase is completely independent of the concrete VCF files you would use to test our solution. This means we could potentially do this preprocessing step before we deliver our solution to you to increase the performance of the solution. My question is if this will be allowed for the competition?

Reply: The preprocessing time will not be counted if it is a one-time computation that can be used by any instance of SMC between the same multi-parties.

If it is required for each time SMC computation, then it should be counted.