May 9, 2018 3:35 pm
Last year we claimed that medical decision making in healthcare is being transformed by big data and that you, as data scientist, can have a big impact on how doctors treat patients. At the Beyond Banking days 2017 four teams of data scientists showed how we can contribute to this movement! A big shout out to the teams of Quantillion, Deloitte, our DIA Lab and risk model validation department. Curious what happened? Read this (Dutch) article in the Financiële Dagblad. Curious whats next? Read this post and submit your team!
The enormous amounts of data that researchers and doctors gather these days mean two things. First it could be an answer to the problem that treatments are always developed for an average patient, while nobody is an average patient. Or worse, treatments are developed for average patients in the trial. Deborah Schrag, a Medical Oncologist at the Dana-Farber Cancer institute said: the average age of patients in a colorectal cancer trial is 55, but the average age of my patients in the clinic is 71. The clinical trial results arent really relevant to my decision making for my patients (At Harvard/ Personalized Medicine Coalition conference, November 2017). The answer to this problem is to further personalize treatments and the large amounts of data that are entering health care enable just that.
Secondly, it means that doctors and researchers, while being highly educated and trained in statistical analyses, really need the skills of talented data scientists. If you have 200 patients and 56,000 data points per patients a linear regression or MANOVA is not going to tell you what to look for. Random Forest, might. And where your average data scientists wont be impressed with a data set is of 200 x 56,000 data points and analytics like random forest, your average researcher / doctor is never trained in random forest (or anything like that) and considers 56,000 data points a huge amount of variables. So what will happen when you join forces? Medical researchers focus on their strengths; biology, and you can go all out with your data science on a huge medical data set. This is why at the beyond banking days we invite you to work with some of the great minds in our medical system and accelerate their research.
So what will happen at the hackathon on 8, 9 & 10 June? In short, we will give you datasets about lung cancer patients cells and hope you can tell us what to look for at the multi-omic level. Dont worry, youll know what the multi-omic level is by time you finished this blog. Basically we hope your data analysis can tell us more about why one patient survives a lot longer than another one. Triggered? Are you good with huge amounts of data? Read this post and submit your team!
Enabling personalized healthcare
Later on, we will tell you more about the specifics of the data and the research questions. To give you an idea on how valuable genomic data can be for a cancer patient, first let me take you back to the biology classes of your final days in high school and some things you would have learned, had you chosen to study medicine or biology. One of things we learned from last years hackathon is the importance of mutual understanding between data scientists and medical professionals. So brace yourself for the biology of cancer and the central dogma in molecular biology.
Biology of cancer
Cancer is uncontrolled cell division. Cells divide themselves if the environment needs them to and this is what for example determines the shape of your body, enables you to grow, to heal, to respond to infections; biologically speaking, this is the essence of life. Two concepts and a dogma are relevant to understand the value of the data you will receive: how cancer and DNA intertwine, what metastases are & the central dogma in molecular biology.
DNA & Cancer
Each cell in your body contains the same DNA comprising roughly 23.000 genes. When a cell divides, your DNA gets copied into the new cell. If the DNA in a particular cell is damaged (due to sunlight, smoking or just bad luck) the cells start using their 23.000 genes in a slightly different way. This is not a problem if your body can dispose itself of these damaged cells, if the damage gets fixed by a process called DNA repair or if the damage has no impact on the behavior of a cell. However, if these processes get out of balance and cells keep on dividing uncontrollably, we call this cancer. If this happens in pulmonary cells, we call this lung cancer. There are different types of lung cancer. The two major classes of lung cancer are distinguished based on the morphology (what the cell looks like under a microscope) of the cancer cells: small-cell lung cancer and non-small-cell lung cancer. We focus on non-small-cell lung cancer as this is the majority of lung cancers.
Primary tumor and metastasis
The DNA in a cell needs a certain amount of damage before it turns into a tumor cell. When this happens for the first time in a patient, we call this the primary tumor. Thanks to work of many great scientists there is often a treatment for a primary tumor. The problem however, is that sometimes a tumor reappears in the body, but with more damage in the DNA. This damage will not only affect the process of cell division, but also other processes like cell migration for example. Migration is ability of cells to move from one part of the body to another. A very convenient skill when an immune cell needs to go to an infection somewhere in the body for example. However, when a cancerous cell can migrate to a different part of the body, we call this a metastasis.
Central dogma in molecular biology
The central dogma is the basis for all medical scientists. It describes the flow of genetic information in every cell: from DNA via RNA to protein. Molecular data can be obtained by measuring these different levels. These are called omics. At the top of the dogma are the genomics. This omic measures the changes in the DNA. You can measure the sequence of nucleotides (the building blocks of DNA) and look for mutations (changes) by comparing this to a reference sequence (eg. a cancerous cell vs a normal cell). The DNA of cancer cells, as explained above, has lots of mutations that only occur in the tumor and not in the rest of the body. These mutations are called somatic mutations.
The next thing you can measure in the genomics is copy number. Easily put, this describes the amount of DNA per gene. Normally, you have two copies of a gene, one inherited from your mother and one from your father, but cancer cells are strange as for some genes they have more than two copies and for others they have less than two copies.
The second level of the dogma is the epigenomics. This omic describes the changes on the DNA. By adding certain molecules on the genome, it can regulate which genes are switched on and off. In general, if these molecules are present on the gene, the gene is switched off; if these are absent, the gene is switched on.
The next level of the dogma is the transcriptomics. We say farewell to the DNA and have now arrived at the RNA. There are different types of RNA and you can measure them all. mRNA is the most well-known type as by measuring this you measure the activity or inactivity of individual genes that are encoded in our DNA. These measurements are what we call an expression profile, an overall overview that provides information on the activity levels of all the 23.000 genes that are encoded in our DNA. In other words, we can measure how DNA expresses itself at the cellular level. This distinguishes, for example, a particular brain cell from a heart muscle cell, but also this distinguishes normal cells from their cancerous counterparts. In fact, nowadays you can even measure the expression of exons, the building blocks of the genes. These exons get pasted together to form an RNA molecule. By skipping an exon or multiple exons, the function of the RNA molecule and the protein can change. Another type of RNA is miRNA. These are very small molecules that play a role in regulation processes.
The last level of the dogma is the proteomics. We have now arrived at the protein level. Proteins have various functions and determine the environment of a cell. In the end, proteins are responsible for all processes happening in the cell. The quantity of these proteins can also be measured.
In conclusion, our DNA (and RNA) is unique and personal. Also, the positions of damage in DNA that causes uncontrolled cell division (the cancer) is unique and personal for each patient and we can measure that on different levels. This means the way patients should be treated might be different and personal for each patient and making sense of all this medical data is so relevant. In other words, this is why we strive for personalized medicine and why we need your help.
Still with us? No stress, before and during the hackathon we organize sessions to help you further understand. And medical professionals feel just as challenged to understand your data science, so you will be doing the explaining during the hackathon. Lets leave the biology for a second and go back to the challenge at hand.
Why did we choose lung cancer?
Globally, each year more than one million people will receive the diagnosis of lung cancer and also one million people will die from lung cancer. Indeed, the prognosis is very poor despite chemotherapy, the introduction of molecular targeted therapies and immunotherapy. Nowadays, step by step we are beginning to understand the development of cancer. Not only in the tumor things go wrong, but now we know that the immune system is slacking as well. Very much alike bacteria and viruses, the immune system removes suspicious own cells from the body. However, sometimes tumor cells escape from the immune system and patients develop cancer. Immunotherapy aims to reverse this process: it reactivates the immune system to recognize the cancer cells. In recent years, immunotherapy has emerged as a promising treatment for Non-Small-Cell Lung Cancer (from now on will just say NSCLC). NSCLC comprises of squamous cell carcinoma and adenocarcinoma and a few more rare types, it is one of the leading causes of cancer deaths worldwide. It is a class of lung cancer that accounts for approximately 85% of all lung cancer cases. Immunotherapy has revolutionized the way patients with this type of cancer are managed.
This looks very promising for patients with NSCLC. Unfortunately, response rate remains limited (about 20% has a good response to this therapy); and as with all therapies, side effects are a big problem. Ideally, we would only give treatment to patients that will benefit from them. Eventually we hope to find a way to find a way to treat patients in an individual way: the right treatment for the right patient. And that is where you come in.
Time to talk about the data that you are given at the hackathon. As explained earlier, the molecular data can be obtained from different levels, the so-called omics. Most of the time, molecular medical researchers measure one of these omics to study a disease. Rarely, multiple platforms are measured in a patient. However, weve got data from multiple platforms. We chose to combine two subtypes of lung cancer: squamous cell lung cancer and adenocarcinoma. In more than 1,000 patients genomic, epigenomic, transcriptomic and proteomic data have been acquired.
Genomics. We follow the central dogma of molecular biology downwards. At the top of the dogma we have the genomics (remember?). Genomic data consists of DNA mutations and Copy Number. The dataset with DNA mutations has roughly 400,000 measurements in total that represent somatic mutations. Copy Number data has more than 23,000 measurements as these are measured per gene.
Epigenomics. Epigenomic data comprise of DNA methylation data. For more than 450,000 positions, measured per patient there is data if that spot is methylated.
Transcriptomic. Transcriptomic data consists of miRNA, mRNA and Exon expression data. More than 2,000 miRNA molecules have been measured per patient. Another form of RNA – mRNA – has also been measured. For all ~23,000 genes, we’ve got the expression values. The exon expression data is similar to regular mRNA expression data, but instead of ~20,000 measurements, exon expression data has over more than 200,000 measurements per patient!
Proteomics. The final level of the central dogma is the proteomics. For a few proteins (< 300 per patient) there is data on the expression. Current theories estimate that there are roughly 100,000 different proteins in the human proteome. So 300 is not so impressive, but these 300 proteins can nevertheless contain a lot of information!
Phenotype. The last dataset contains phenotypic, clinical, prognostic patient information. For all patients, we know the gender, age, whether the cancer has metastasized, the approximate location of the tumor, the stage of the cancer, the vital status and time to event, type of treatment and even more details.
Now, you hopefully got an impression of the data you will be given at the hackathon, let’s take a look where your experience and expertise can make the difference. As explained earlier, molecular researchers usually focus on one of the omics to study a disease. You will receive data from four different omics in eight datasets. We know that these omics are connected through the central dogma and because the same patients were used. So we challenge you:
- Can you show and visualize the correlations and concepts between the different datasets?
- Can you accurately predict one of the omics based on the other?
- As we know lung cancer has many subtypes we can already distinguish, but probably many more that we cannot distinguish at the moment, can you stratify the patients based on all the omnics data in to subgroups?
- Can you visualize to what extend each of the data variables at DNA (mutations, CNV), RNA (Expression, mutations) or Protein level contribute to the phenotype based clinical sub-stratification of the patients?
- Can you find an underlying structure in the omics using time to event as a reference?
- Can you find an underlying structure when combining the omics data, so going from DNA, subsequently combine DNA with RNA, methylation and proteins along the line of time-to-event?
- Can you identify a signature based on an integrative approach that can predict response to chemotherapy?
- Can you identify a signature that correlates with the prognosis of chemotherapy?
- Can you select a list of most important variables that drive the predictions?
- Can you select a list of most important variables distinctive for each patient subgroup?
First of all, know that some very sick people will have more to hope for thanks to your work. Now, your model will need further validation and additional research that may lead to a publication in a scientific journal. As a token of appreciation, Prof. Dr. Ing. Peter van der Spek (Erasmus MC Rotterdam), Prof. Dr. Harry Groen (UMCG, Groningen) and Prof. Dr. Joachim Aerts (Erasmus MC, Rotterdam) warrants that your contribution will be acknowledged as a co-author on this scientific paper. And of course, eternal glory falls upon you and your team as you also compete for one of the cool prizes of the hackathon! So join our community and submit your team.
See you there,
Daan Hurkmans (Erasmus University Medical Center)
Menno Tamminga (University Medical Center Groningen)
Rogier van Wijck (Erasmus University Medical Center)
Tjebbe Tauber (Abn Amro)
Categorised in: News
This post was written by BasDV