Healthcare Datasets –Good Data is Gold

One of the greatest challenges in medical research is getting good data. With good data, you can then train high performing AI/ML models

How do you do this?

Have access to patient data
Select the right features
Perform the correct statistical analysis/modeling

For the rest of this post, I’m going to go through each of these steps in a bit more detail.

Access to Data

One of my mentors who does a lot of AI/ML medical research once told me that the greatest point of leverage that he has as a doctor working in the AI/ML medical field is that he has access to data. This is a challenge if you are not a trained physician. It is no secret that high quality data is what makes or breaks a model. But, as a physician, even if you have access to patient data, you have to put in the effort to retrieve and use the data in a manner that is complaint with IRB, HIPAA and other privacy protection laws.

Selecting the Right Features

Once you went through the trouble of accessing patient data, you need to make sure that the right features are queried into a dataset. For instance, if I were trying to build an ML model that can predict whether or not a patient will get sepsis (i.e., when an infection spreads throughout body), I would have to pick which features I thought would be helpful for the model. Diabetes and age are features that are likely to help in predicting sepsis, but type of insurance may not be that helpful. It takes someone who has an intimate understanding of the problem, in this case, an infectious disease doctor, to inform the data collection process.

Even after the right kind of data is collected –let’s say we are dealing with imaging data now just to mix things up– this data needs to be structured in the right way. In the case of CT images that go into training an deep-learning imaging model, you need to format each individual 2D slice so that the dimensions are consistent and that the boundaries that are drawn in are well demarcated. Often times, different medical research teams develop a pipeline that does this data preprocessing in a standardized format.

Perform the Correct Statistical Analysis

The next steps is actually the easiest part. Training the model itself is usually only a couple lines of code these days because you never really have to build an AI model from scratch anymore thanks to awesome ML python packages that are freely available. All you need to do is tweak the parameters for your model and you are set. However, I do think it is helpful to actually build something like a logistic regression model with cost functions and all that under-the-hood math from scratch at least once if you are serious about learning ML/AI. Andrew Ng has a great course called Machine Learning on Coursera where he teaches this.

Something I’ve noticed is that many medical research papers that do use AI/ML models don’t have a standardized way of measuring performance –how to split the datasets and which types of models to compare for differnt use cases– but I am hoping this will change as more studies are being produced in the area. Where we ultimately need to get to is the point where all models are tested for generalizability against a completely different dataset that capture a completely different patient population. This would be the gold standard for an excellent model. However, this is an ambitious goal because there is still a lot of resistance to data sharing across hospitals. Some efforts are being made to open-sourcing databases and I think is great for the medical research community.

It is a bit strange to me that it cost so much money to access healthcare data for the sole purpose of research. One database that I’ve tried in the past to get access with a mentor solely for research purposes is Healthcare Cost and Utilization Project (HCUP) data. Keep in mind, this database is produced by a branch of the US government, and they were charging me, a broke medical student, over 500-1000$ for access to one dataset –they have a ton of datasets btw. It’s not clear to me why there is such a lack of cooperation when it comes to data sharing in the medical world. Yes, data privacy is a real concern, but these datasets are already de-identified.

I will say that there are some large medical databases that are open-sourced like the FDA’s FAERs reporting adverse drug reactions or MIMIC-IV reporting patients outcomes from Beth Israel Hospital. However, the trade-off with these databases is that the data is quite messy in it’s current state or just lacks quality control. I’ve worked with both databases and that was just my impression.

I don’t think it will be long before ML models are being used in hospitals to make real-time predictions to guide treatment plans, but it will definitely demand better data management than what we are seeing today.