Track 2: Federated dataset selection for collaborative machine learning using biomedical data

Organized by Haixu Tang, XiaoFeng Wang, Yongming Fan, Zihao Wang and Rui Zhu

Background

Machine learning (ML) models have been developed to predict the risks of complex diseases such as cancer from phenotype data [1, 2, 3, 4]. In practice, collaboration is often required in order to build the ML model on the data from multiple cohorts held by different organizations. However, in many cases, the organizations could be reluctant to share their data human subjects due to the restrictions of their organizational data sharing policies. Therefore, it becomes highly desired to enable two or more participants (clients) to collaboratively build a ML model without directly sharing the data, a scenario often termed as federated learning or collaborative learning. Current literature indicates that Federated Learning (FL) can integrate multicenter phenotype data while protecting privacy, thus enhancing the accuracy of cancer prediction models [5, 6, 7]. Indeed, FL has shown significant advantages in cancer subtype classification, drug response prediction, and tumor biomarker discovery.

One of the primary challenges in applying FL to disease prediction is the potential bias or skewness in local data from various clients. Without appropriate data filtering or correction, integrating these skewed datasets may result in an ML model that underperforms those trained solely on local data [8, 9, 10]. This issue arises because the model attempts to generalize across skewed data inputs, leading to potentially inaccurate or unfair predictions. Currently, an effective solution to mitigate skewness involves assigning different weights to the intermediate results uploaded by clients. Clients that may negatively impact the final result are assigned lower weights, while those that are likely to positively influence the final model performance receive higher weights. However, determining which clients are more likely to have a beneficial impact on the final model’s performance remains a challenging task.

Challenge

Our challenge focuses on federated survival analysis using the COX model [11, 12], i.e, to predict the survival rate based on the input phenotypes of patients. We invite participating teams to propose a weighted aggregation algorithm that demonstrates high generalizability. The goal is to optimize the aggregation by assigning each client different weights according to their data quality or bias [13, 14, 15].

We challenge participating teams to implement a weighted aggregation algorithm. We will provide phenotype datasets (including labels) from multiple clients and a predefined FL framework with preset hyperparameters. Teams are tasked with developing an algorithm to assign each client a weight ranging from 0 to 1, with the baseline being weight assignment based on the number of samples each client holds.

We provide the code and the training data for our federated COX analysis at https://github.com/idash2024/iDash2024. Everything except the weighted aggregation algorithm will be fixed. The parameters and features passed into the weighted aggregation algorithm can be adjusted based on your needs. Participants can use the baseline code to fine-tune their weighted aggregation algorithm and aim to achieve the best possible performance on an unreleased testing dataset.

Evaluation Criteria

We will use two evaluation criteria:

  • The metric used to evaluate the generalizability of the weighted aggregation submitted is the concordance index (c-index). It measures how well the risk scores rank in relation to the times-to-event on an unreleased testing dataset, with a high c-index indicating accurate inverse ranking.
  • The metric used to evaluate the efficiency of the weighted aggregation submitted is the computation and communication cost. It measures the training time required by the method, with lower times indicating higher efficiency.

If the difference in c-index values is less than 1%, we will compare their efficiency for the final ranking. Otherwise, we will use the c-index alone for ranking.

Experimental setting

We will provide the clients' datasets (D1, D2, …, DC) along with a FL framework. Each team needs to design a weighted aggregation algorithm to compute a 0-1 value for each Di (i=1, 2, …, C). We will evaluate this model on a holdout testing dataset (Dtest). We will run the source code on our server to evaluate the algorithm’s efficiency and produce the resulting model. Subsequently, we will test the model on an unreleased dataset to assess the algorithm’s effectiveness.

Test Platform Information OS

TBD

Data Skeleton

The dataset provided contains various attributes related to patients, including demographic information, clinical features, and specific measurements labeled as ‘E’ and ‘T’. It is formatted as a .csv file with the header indicating the following 39 attributes: Demographic features include age_at_index, ethnicity (not_hispanic_or_latino, not_reported), and race (asian, black_or_african_american, not_reported, white). Clinical features cover ajcc_pathologic (m_MX, n_N0, n_N0 (i-), n_N1, n_N1a, n_N1b, n_N1mi, n_N2, n_N2a, n_N2b, n_N3, n_N3a, n_N3b, n_N3c, t_T1, t_T1a, t_T1b, t_T1c, t_T2, t_T3, t_T4, t_T4a, t_T4b, t_T4d), treatment_or_therapy (not_reported, yes), and tumor_stage (stage_i, stage_ia, stage_iia, stage_iib, stage_iiia, stage_iiic). The specific measurements are E and T.

FAQ:

For questions and clarifications, please check and post on FAQ @

https://docs.google.com/document/d/17eZwmLRLEe_3GjAdplPi5Hi--bpP1iaNGZwHZovu5Tc/edit?usp=sharing

If your question is not in the FAQ, please contact us at fan322@purdue.edu.

Submission:

Please send your submission to fan322@purdue.edu with the email subject “iDash track 2 submission from team xxx,” where xxx is your registered team name. Please attach a zip file containing the complete, runnable code (Based on https://github.com/idash2024/iDash2024). Ensure that it can be executed by running `python3 federated.py`. Additionally, include a brief description of your solution. We will test it accordingly. .​

Terms of Use

The data terms can be found at https://gdc.cancer.gov/access-data/data-access-processes-and-tools. Please note that we only use unrestricted data, but we do not guarantee that the use of this data is completely free for the user. It is mandatory to check the applicability of the license associated with this data before using it.

In particular, according to the GDC data access policy at https://gdc.cancer.gov/about-gdc/gdc-policies, users must not attempt to identify individual human research participants from whom the data were obtained.

In line with TCGA policies (https://gdc.cancer.gov/egc/research/genome-sequencing/tcga/history/ethics-policies), special care has been taken to ensure the privacy protection of research subjects, including compliance with HIPAA regulations. Please note that we do not use the genetic data from TCGA, as its access is restricted due to its sensitivity.

References

  1. Alfayez, Asma Abdullah, Holger Kunz, and Alvina Grace Lai. "Predicting the risk of cancer in adults using supervised machine learning: a scoping review." BMJ open 11, no. 9 (2021): e047755.
  2. Liu, Jiaqi, Hengqiang Zhao, Yu Zheng, Lin Dong, Sen Zhao, Yukuan Huang, Shengkai Huang et al. "DRABC: deep learning accurately predicts optimal immune pathogenic mutation status in breast cancer patients based on phenotype data." Genome Medicine 14, no. 1 (2022): 21.
  3. Zou, Dex, Lixin Yang, Yu Jin, Huan Qi, Yahu Li, and Li Ren. "Machine learning-based models for the prediction of breast cancer recurrence risk." BMC Medical Informatics and Decision Making 23, no. 1 (2023): 276.
  4. Gharib, Badr, and Aleksander Vakanski. "Machine learning methods for cancer classification using gene expression data: a review." Bioengineering 10, no. 2 (2023): 173.
  5. Buol, Constanza, Can Yousef, Tural Iqra, Meach Mahmoud, and Eric W. Traerl. "Differentially private federated learning for cancer prediction." arXiv preprint arXiv:2107.02997 (2021).
  6. Almutraf, Marah Fahaad, Noshira Tariq, Mamoona Humayun, and Bushra Almas. "A Federated Learning Approach to Breast Cancer Prediction in a Collaborative Learning Framework." Healthciences 11, no. 1, pp. 3-185, PMID, 2023.
  7. Yiong, Goudong, Ming Xie, Tao Shen, Tianyi Zhou, Xianzhi Wang, and Yong Ding. "Multi-center federated learning: clients clustering for better personalization." World Wide Web 26, no. 1 (2023): 481-500.
  8. Rajendran, Suraj, Zhenxing Xu, Weishen Pan, Arnab Ghosh, and Fei Wang. "Data heterogeneity in federated learning: clients clustering for better personalization." PLOS Digital Health 2, no. 3 (2023): e000017.
  9. Guo, Yongxin, Xiaoyang Tao, and Bo Tian. "FedBr: improving federated learning on heterogeneous data via local learning bias reduction." In International Conference on Machine Learning, pp. 12034-12045. PMLR, 2023.
  10. Abay, Aminu, Yi Zhou, Nathalie Baracaldo, Shashank Rajanomi, Ebube Chuba, and Heiko Ludwig. "Mitigating bias in federated learning." arXiv preprint arXiv:2012.02447 (2020).
  11. Andrex, Xiaodong, André Manoel, Romuald Menzuel, Charlie Saillard, and Chloé Simpson. "Federated survival analysis with discrete-time cox models." arXiv preprint arXiv:2006.08997 (2020).
  12. Liu, Jianfang, Fan Lichtenberg, Katherine A. Hoadley, Liara M. Poisson, Alexander J. Lazar, Andrew Shenkler, Albrecht J. Karol, et al. "An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analyses." Cell173, no. 2 (2018): 400-416.
  13. Ji, Zixi, Tao Lin, Xinyi Shang, and Chao Wu. "Revisiting weighted aggregation in federated learning with neural networks." In International Conference on Machine Learning, pp. 19767-19788. PMLR, 2023.
  14. Ye, Rui, Mingkai Xu, Jinyu Wang, Chenxu Xin, Sheng Chen, and Yanfeng Wang. "Feddisco: federated learning with discrepancy-aware collaboration." In International Conference on Machine Learning, pp. 39879-39902. PMLR, 2023.
  15. Chen, Ajit, Bertram Ng, Mengfei Cui, and Yong Xia. "Think Twice Before Selection: Federated Evidential Learning for Medical Image Analysis with Domain Shifts." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11439-11449. 2024.