Conducting medical research is a difficult endeavor. It requires significant time, resources and work just to get to the core activity of using your talent as an algorithm or machine learning researcher.
Clinicians provide us interesting research questions, and we are eager to use our abilities to make an impact. It’s not just a cliché. But then we are faced with the big problem of medical data. Unlike web based data, such as ads, clicks, impressions, posts etc, where data is abundant, access to large scale medical data is limited, and the data itself is complex to use, often in non-standard formats and on legacy platforms.
So the first problem we face is collecting data at a scale that is relevant to our research, and having that data in a meaningful format that allows us to actually conduct research. Accessing hospital data directly is very difficult, requiring IRB and Helsinki approvals, integrating with PACS, RIS and other clinical systems, anonymizing data, ingesting it, indexing it and structuring it for organized research. This process, and the barriers associated with it prohibit most researchers, even those interested, in actually moving forward with meaningful research. It also limits research only to those in academic medical institutions, where some of those barriers are a little easier to manage.
For those of us still desiring to conduct medical research – we often resort to using public datasets. For example, for Breast cancer research, there are 2620 studies available from the DDSM (The Digital Database for Screening Mammography), 699 studies from the Wisconsin Breast Cancer database, and an additional set of 286 studies from the Oncology Institute, at the University Medical Center in Ljubljana Yugoslavia. For lung cancer, the LIDC (Lung Image Database Consortium) contains 1010 studies, and the large dataset of the NLST (National Lung Screening Trial) provides up to 26254 studies, but with limited access, and in most cases without any control group and very few annotations.
Assuming we manage to gather the data we require, imaging research – especially machine learning, requires a high performance computing environment with access to the data. Needless to say how expensive this can get, even if trying to use free, open source tools. CPUs, GPUs and storage costs add up quickly when dealing with large quantities of data.
At Zebra, we’ve tried to address the above hurdles through our open Research Platform. We’ve taken over 12 million imaging studies and associated clinical information – anonymized them, indexed them, and placed them in a free, openly accessible cloud. The platform houses over 6 million conventional radiographs (CR & DR), nearly 1 million CTs, 400,000 Mammograms, and hundreds of thousands of pathology proven outcome reports. We’ve curated data cohorts for important clinical challenges, including annotating the relevant findings. Within the Platform you’ll find web applications to view the data, tools for annotating it and managing the research datasets. We provide free storage and computing power, enabling researchers to focus on the core – research. Over the next few weeks and months we’ll tell you more about the platform and its capabilities. In the meantime, feel free to check it out at www.zebra-med.com
Stay Tuned for our exciting news at HIMSS – February 29-March 1st, 2016 in Las Vegas. Come see us at Booth #4416