SEGMENT 1: Dr. Matthew Freiberg discusses the role of Big Data in his work and the scientific potential of Big Data. Dr. Gary H. Gibbons, Director, National Heart, Lung, and Blood Institute: Hello. I am Gary Gibbons. Today my guest is Matthew Freiberg, an associate professor of medicine and epidemiology at the University of Pittsburgh School of Medicine. Matt is also involved in NHLBI’s HIV/Cardiovascular Disease Collaborative. This fosters collaborative research into the mechanisms of metabolic and anthropometric abnormalities seen in HIV infection. He is highly active in antiretroviral therapy and the relationship to cardiovascular disease risks. He has made a series of provocative observations with his team. As you know, we are in an age where we hope to leverage the digital revolution and electronic medical records. Many people are talking about big data, the confluence of both the biological and some of this tracking. Could you say a little bit about the state of play of some of both the challenges and opportunities? Dr. Matthew Freiberg, University of Pittsburgh: Sure. Our work with HIV and cardiovascular disease, one of the advantages we have is that the VA healthcare system basically cares for more HIV infected people than any other group in the United States. So it gave us a unique opportunity to examine HIV and cardiovascular risk with thousands and thousands of HIV-infected people in managed control. One of the issues we struggle with, and everyone who deals with bigger health system data is, how do you get that data out of the system and move away from what we would call administrative data, ICD-9 codes, to the actual data that the clinicians order, that they see, like an LDL cholesterol and HDL cholesterol, so that you can use those information to try to understand the associations you are looking at, like HIV and cardiovascular risk. And it’s not as simple, if you will, as walking through a grocery store and picking it out. You have to build that infrastructure. But once you do, what you end up with is terabytes of data, if you will, of stuff that physicians have already collected that are in the records that you can use for research. Let me give you an example. In our cohort, instead of having an administrative code for cholesterol and renal disease or obesity, we actually have their LDL, HDL, triglyceride levels. We have their estimated glomerular filtration rates. We have their blood pressures, their body mass index measures. We have their actual clinical data. But we have it not just on 5,000 people; we have it on 82,000 people. And it’s the ability to do that which really allows you to ask questions that you might not otherwise be able to ask in the past. For example, in HIV, one of the big underlying concerns had been that is it really HIV, or is it the stuff that travels with HIV, that’s driving the risk. So for instance, 60-70 percent of the HIV-infected people that we have in the VA and in other cohorts may be current smokers. So is it that HIV, the virus itself, is really doing something that’s driving this risk? Or is it just that the prevalence of smoking is so large that it’s the smoking that is driving this excess risk? And why does that matter? It matters because you would intervene, or change what you do as a clinician, dramatically. In order to really answer that question when the prevalence is so high, what you would really like to do is to have a group of people who were never smokers, so it’s not even in the equation. Well, if you want to tie those kinds of people to hard events, meaning a heart attack, not just do they have subclinical atherosclerotic or coronary calcium, but did they have an event? You need to have thousands of people for that. Well, in our study, because we start with 82,000 people, we restricted it down to never smokers, which is still 15,000, and we were able to clearly show that even when smoking wasn’t present, these HIV-infected people still continued to have risk. These big health care systems allow you to look at questions you couldn’t do if you had a population of 5,000 people, because if you restricted that to never smokers, you might be left with 1,000 people and it would just be woefully underpowered. Now, our group, we have decided, okay, some of the material in these data sets is easier to get out than others. What do I mean by that? Well, you may have a laboratory file that allows you to pull out, let’s say, LDL cholesterol, HDL cholesterol. But let’s say you want to do something a little different. I will give you an example. Let’s say you wanted to look at heart failure, and more specifically, you wanted to know types of heart failure, preserved ejection fraction versus reduced ejection fraction. The doctor may document their ejection fraction is 35% in some note, or it may come through in a telephone call or a fax from an outside hospital, but it’s not something that’s in a window, let’s say, that’s easily retrievable out of the EMR [electronic medical record]. So it’s kind of the water, water everywhere, not a drop to drink. We know it’s in there. But imagine you trying to get that data out. So we asked for supplemental money to build a tool that could sift through any kind of record within the EMR looking for components of cardiac structure and function. We now have a tool that’s been validated against manual chart review that can pull this data out for us, and it’s accurate 94-95 percent of the time. And for ejection fraction in particular, it moves up to 98 or 99 percent of the time. We now can say, okay, we will take up maybe a code for heart failure, but link it to their actual cardiac structure and function data from their record. We believe now you can really identify heart failure much faster, much more efficiently, than you could ever before. Let me translate this into real numbers. Let’s assume, for argument’s sake, that it takes a person one hour to look through an entire person’s chart looking for measures of cardiac structure and function, let’s just say. Some of these charts are pretty thick, as you can imagine. One hour may be a real underestimation. Let’s say that is true. So, you’ve got 120,000 people in this data set, our whole data set, people with and without prevalent cardiovascular disease. If each person got an hour, that is 120,000 hours. That is roughly 60 FTE, roughly, I think if my calculations are right, of people working full time to retrieve that data. Now, let’s just say somebody costs 30,000 dollars to do this. You have 60 FTE at 30,000 dollars for a whole year to retrieve that data out and organize it. This tool that was sponsored and produced by the NHLBI, can do that work in three weeks and it costs us half a million, well, $400,000 to build. But now it’s built. It can be used and reused. And importantly, if done right, and this is something that I think is critical to understand; if we start working as teams, meaning it’s not just the VA – what about EpiCare Partners, or EpiCare at Pitt or EpiCare systems at Kaiser? If these tools can be put on multiple platforms, so not just the VA platform, but let’s say the EpiCare platform or Cerner’s platform – now you’re talking about a tool that can be reused by other people. So now you’re not just talking about Matt Freiberg’s cohort or Amy Justice’s cohort in the VA for 120,000 people. We may be talking about being able to look at cardiac structure and function in millions of people across millions of healthcare systems. So now, it’s not just veterans, it’s non-veterans, or it’s people from the northwest, or it’s African Americans in the South where the prevalence may be higher, or more women here. That’s what we are really talking about. That’s what we want to do. Where I think we’re moving as a group, and our HIV cardiovascular risk is a model for that – what we’ve done and what we’ve built here within the VA – if these tools can be put into a proverbial toolbox for others to use on top of other platforms, then we have the ability to harness all that data that is already being collected in these electronic medical records as part of care. You don’t have to worry so much all the time if every single piece is being easily extractable via some sort of screen. Not that you don’t want to do that, but it allows you to really harness the power that’s in there. While this is a tool for cardiac structure and function, you can imagine you could build a tool for pulmonary function tests, for the lung portion of the NHLBI. You can imagine somebody might build a tool that helps you look for cancer for NCI. This isn’t just about this project. It’s hopefully a way that what it does is it spurs innovation for people to see how we can maximize the data that we’re collecting every day. And as we’re pushing, as a community, for these electronic medical records to improve health care, ignoring the research component for a moment, to improve health care. If most people in this country are associated with a health care system that uses one of these records, then theoretically, if we build these tools right, we may be able to harness that data so you’re really looking at potentially like a virtual U.S., where everybody can potentially contribute, obviously in a de-identified way, so that when we’re using these data to decide on clinical trials or other cohorts, everybody may contribute to who do we select, and how we select. It’s not just a smaller subset of people that choose to enroll in these studies. Everybody may participate. That’s kind of exciting, at least from my perspective. To be really clear, the tool development is really a function of multiple people. So Scott DuVall at Utah, Cindy Brandt at Yale, and myself kind of worked together and Scott actually started this tool as just ejection fraction prior to me suggesting that we really could do a lot more with this. Think of it as a Model T Ford, but we really want the Ferrari version. Now that the tool is out there and we’re publishing, we’re working on writing a paper saying this is what this tool can do. What we really need to do next is ask ourselves: Can we take this tool and can we apply it to another healthcare system? For example, one of the things that I would like to do here at Pittsburgh, could we take the VA tool and apply it to EpiCare at the University of Pittsburgh? And if we can’t, what do we need to modify, change, or alter so that we can? Because if we show this tool now can work on an EpiCare platform, let’s just say, then we can say, okay, well who are all the groups all across the country that have EpiCare? If they do have EpiCare, would they be interested in this tool? Would they like to use it for research? The immediate next step, now that we believe this works in the VA, is to ask ourselves, can it be used on another platform, work with the bioinformatics people who are very familiar with that platform, and ask ourselves: Does this language work, so to speak, or are we talking French and Spanish and we have to translate a little bit? Okay, if we need to translate, how do we do it and let’s do that. My hope is that places like the NHLBI and other institutes, if they see the value of this, they put out initiatives that say, we really want to do this. We want to dedicate funding for these initiatives so that these health platforms can really talk to one another. So that if we’ve got these tools, whether you’ve got a Subaru, a Ford or a Chevy, because you have Cerner, EpiCare or the VA, bottom line is these tools drive work and get you where you want to go. If we can do that, then we can really start thinking about how we can pool some of these data together, because we know we're extracting the data in a very similar, if not an identical, way. You’re not saying we are comparing apples and oranges on the data side; we are hopefully comparing oranges to oranges, and the only thing that’s really different is the tool. Why do I say that? We know that a lot of things such as echocardiogram reports are relatively standardized. They’re certainly not identical provider to provider or institution to institution, but they’re similar enough, I bet, that when you’re pulling out ejection fraction, an ejection fraction at Kaiser is an ejection fraction at Pitt is an ejection fraction at the VA. If we think big, like big data, we really think big, what ends up happening is that now you’ve got multiple health care systems with all of these data pulled in a similar fashion and everyone using this proverbial toolbox. Then you can imagine we can start pooling our data together to start looking at really big questions that you could never see, even with 120,000 people. There may be patterns or trends in disease, or even rare diseases that you have to have enormous numbers to really find enough of them, but if you linked multiple health care systems across the country, you may be able to study diseases that were rare before in a way that you’ve never been able to do it. We all know that when you’re trying to understand disease, sometimes the rarest diseases are the ones that give you the most information because they’re so unusual. We see that on a basic science level, as well as a clinical level. If you’ve got these millions of person’s data sets, you can imagine that you may be able to look at things in a whole new way. And how do we know that’s important? Look at what we are learning in genetics, right. It used to be the Framingham Heart Study, CHS [Cardiovascular Health Study] has a data set, WHI [Women’s Health Initiative] has a data set, and now we have created these consortiums of genetic data. Why? Because we want to pool all that together to have more power to look at more unique observations that you couldn’t see in one cohort. Genetics in some ways, the way they have thought about their science, is exactly how we should be thinking about clinical research from an epidemiologic or health services perspective. It’s analogous. It’s just that they have these cool tools to be able to go in and grab all this DNA, even the deep sequencing, and then pool it together. We’re arguing for an initial process, if you will, but on the clinical phenotypic data. Then if you think about it one step further, obviously, the geneticists really want that phenotypic data, as detailed as they can get, because it’s the linking of the phenotypic and the genotypic data that really starts to give you some answers for certain questions. It is funny, as an epidemiologist, we’re all really thinking about the elephant in slightly different ways, but it’s still the elephant. My hope is that these tools really provide a way for people to look at the whole elephant. Theoretically, on the phenotypic side anyway, anything can be looked at if we build the tool right. There’s no reason you can’t build a tool for any disease that you want, it’s just an issue of money, resources and time, and then working together.