Learn both the powers and pitfalls of Large Language Models and hear about groundbreaking research where AI is already improving clinical outcomes. This isn’t science fiction; it’s technology you could be using tomorrow.
My name's Bev Brzezinski. I'm a neonatologist and I'm the chief medical officer and the vice president for quality and safety at St. Louis Children's Hospital. And I, I was talking to, um, to Doctor Dunnigan that it always is a very humbling experience and makes me very proud to be able to introduce one of my young partners, um, to talk about something that I never learned about in, in when I was training. Um, and so I'm, I'm very, um, happy, um, and honored to introduce Doctor Zach Basilis today, who is a neonatologist and co-director of the neuron NICU program at St. Louis Children's Hospital and an assistant professor of pediatrics at Washington University. His research focuses on, um, computational modeling of the neonatal brain at, at, uh, injury risk. He, um, his strategies for novel neuroprotection and of course AI applications to neurodevelopmental outcomes. His primary interest lies in determining the causes of neatal brain injury and developing tools to improve long-term, long-term outcomes. And today he's gonna talk to us about speaking AI's language, practical applications for better patient outcomes. Zack. Thank you. Um, that's if it makes you feel any better, uh, Doctor Brzezinski, this didn't exist when I was training either, so this is all very, very new technology. Um, right, so I, I appreciate the opportunity to speak today. Uh, hopefully, uh, you guys find this, uh, next hour or so, uh, very interesting, uh, maybe a little funny. We'll see if my jokes work or not. Uh, I have a few disclosures, uh, mostly just different sources of support for the research that we do. So there's 3 different objectives that I'm planning to cover today. First, I'm going to talk about large language models. That's a special type of AI that's different than than what we heard a little bit earlier today, where it came from, some of the advantages, some of the disadvantages of it. I'm gonna then switch to talk about some examples of how. AI is being used in clinical practice and research today, uh, in the neonatal ICU, uh, because that's where I work and then, uh, finally discuss a potential framework. These technologies are very exciting. They're also a little scary. How can we incorporate what we know about AI today into our practice tomorrow? So you guys heard about this a little bit earlier, the Gartner hype cycle. This is this idea, any new technology, AI is certainly one of them, but applies to things like iPhones or, um, new, uh, uh, clothing items. People get really excited as a new technology comes out and there's all sorts of expectations, um, before people have a chance. To see it once reality sets in, that, uh, excitement diminishes. So as I talk about, uh, generative AI today, I'm gonna hopefully take you through this hype cycle. Um, so we're starting at the beginning. This is just where the, the technology starts. So AI you can really think of as this onion. Uh, with the broad concept of AI as an umbrella describing all sorts of different predictive computer technologies, and as you get deeper and deeper into that onion, the models get more sophisticated and they they have different capabilities. We're talking all the way at the core of that onion right now, generative AI. So rather than making predictions, this is creating new things, new content, new texts, new videos, new pictures from prompts that we give the computer. And that really raises a lot of questions about intelligence. So AI has the word intelligence in it, but what does that mean for a machine compared to a human being? This is something computer scientists have been grappling with for a long time, uh, since the, um, the 1950s. Alan Turing, who's a famous computer scientist who helped crack the Nazi codes during World War II and then wrote a lot of really important, um, uh, ethical and philosophical discussions on computer science later. Came up with this idea of a test called the Turing test now to identify whether AI is really intelligent or not. And the test he came up with was can an outside observer tell from a transcript of a conversation between a person and another person or a person in a computer, which one is the person and which one is the computer? And if you can't tell the difference between the two of them, for all intents and purposes, that's intelligence. This was updated a little bit later by John Searle, who's another computer scientist. This is from the 1980s, and came up with this idea that he called the Chinese room. So in this example, and this will make more sense in context in a minute when I talk about large language models, he imagines this room and inside this room is a man who's sitting there with this big pile of books. And on one side of the room there's a slot where people can put papers in. And on the other side of the room there's another slot where he can put his output, another piece of paper, and on one side, this person is putting a note written in Mandarin. Uh, that that's asking a question or saying something. The man inside the room does not know Mandarin, doesn't speak a single word of it, but he has this whole set of encyclopedias that say if you see this symbol and that symbol and this other symbol. Then write this, that and this other word. Um, so he does that. He follows the directions and puts the paper out the other side of the slot. So if you don't know who's in that room, you think that it's a native Mandarin speaker. You can't tell the difference. It's a black box to the outside observer, but inside the room, we know this guy doesn't know a single word and is just following instructions. And in a lot of ways that is kind of what some of these AI models are doing. So large language models are very new. 2022 is really where they became widely available to the public. There were research versions of these available sooner, but the ones that were accessible to the consumer without a lot of university or expensive computer resources, um, were really not until that time. There are many of them available now. Chat GPT is probably the best well known, but there are others, and they are produced by many of the companies that you're very familiar with by Microsoft and Facebook and Twitter and, and so on. And one of the earliest uses that got a lot of public press and I have some of the news articles clipped at the bottom here. Um, there were people who had had chronic medical problems that nobody had ever been able to diagnose, and they went into the chat box and said, you know, I get a rash every month and I have a fever every 14 days and I get bumps on my toes. What do I have? And this is something nobody's ever been able to figure out. And the computer spit at exactly what the diagnosis they had was right away. Um, and it was, it was shocking to many people that this was able to figure out what somebody doctors and medical teams hadn't been able to figure out for years. So how are they actually working if we open up the hood and look underneath what's inside that room that I was talking about earlier? And really what they're doing is using probability. It's trying to predict what the next sentence, the next word, the next paragraph is based on the input that you already gave it. So a very, very simple example. If you take this prompt, complete this sentence once upon a blank. It's going to go through the entire dictionary and calculate the probability of every single word once upon a frog, once upon a pizza, things that don't really make sense. It's going to put pretty low. Probabilities for those, but the word time once upon a time, it's going to say that's the highest probability, that's the word to put in that sentence. Now we think it from the outside again, it looks like it knows that that's the classic storybook phrase to start a book, but it doesn't. It's just calculating probabilities. But the results of that can be really impressive. Um, so this is an example that that I generated in Chat GPT and all I gave it was that prompt at the beginning. I asked it give me a short segment of a podcast talking about the impact of AI on healthcare and um it in addition to, uh, generating text, it also generates audio too, so I'll play that in a minute. Now it does these systems learn so Chat GPT has figured out that I'm a neonatologist, so it's gonna have a neonatology bend on this example here, but overall, um, this is all created by the computer. It's not something I generated on my own. Like, is it actually changing things or is this just another tech buzzword? Oh no, it's, it's real. It's, I mean if we don't screw it up, it's not like replacing doctors or anything, but it's making them way better. Radilogy, pathology, even stuff like catching patterns in patient data that we totally miss. Like in the NICU where I work, AI can pick up tiny shifts in a baby's vitals hours before they actually look sick. That's, I mean, that's huge. OK, but I don't know, there's got to be a. right? Oh yeah, tons. Bias is a big one. If AI is trained on bad or like incomplete data, it just makes bad decisions, especially for patients who aren't well represented in the data, and uh trust is another thing. No doctor is going to just blindly follow what an algorithm spits out. If we don't know why it's making a call, we're not using it. OK, but I don't know, there's gotta be a catch, right? Honestly, AI is already here and it's powerful. But if we don't get it right, if it just adds noise instead of actually helping, then what's the point? Get it right though, total game changer. So in that hype cycle, that's hopefully we're at that peak of inflated expectations. I, I, I think this is pretty impressive that I can put this sentence. I didn't even capitalize the words or use punctuation, and it gave me this entire segment of um a very realistic uh um podcast segment and even the voices are generated by by the AI. So AI and help. Um, so it's also very impressive in other domains of medicine. In many places it matches or even exceeds the performance of doctors. In ophthalmology, if you train it on models of diabetic retinopathy, it can identify it better than ophthalmologists can. If you show it pictures of skin cancer, it can identify that better than dermatologists can. If you show it chest X-rays, it can identify pneumonia faster and better than a radiologist can. It's very, very good at these classification tasks. It's also very good at taking board exams, it turns out. Uh, people have taken, um, these large language models, and these are just general models. They're not ones that are designed specifically for medicine. And given it various board exams, you can see in this table that I made here, the different types of um initial certification and so on exams and the impress the performance has improved impressively. It used to get less than half of them, right? They probably would have failed out of medical school, but most recently, it gets 90% of the questions right or higher, which is, you know, pretty darn good. So you know this is coming. We're going to hit the trough of disillusionment. There's got to be a catch to this. There has to be a problem. And there definitely are, there's some real downsides. So to to us, to the way these things work, it often looks like magic that there is some thinking, breathing person inside the box that is coming up with these very impressive creations out of whole cloth, and a lot of it is really because of the training. So Chad GPT, the current version 40, was trained on 1.7 trillion pieces of information. That's no human being, of course, the entire lifespan could incorporate that much information into our brain. But at the end of the day, it's not really understanding the content. It is still working on probability. It is just trying to predict what the next sentence, the next word, the next paragraph is, and it's giving us very, very good guesses. And it turns out if you're really good at guessing, it almost looks like you know what you're talking about. But it's trained from the internet, which is great because there's a lot of stuff on the internet. It learns from news articles, from blogs, um, from, um, various other pieces of literature that are published, music and pictures, but it also means it has all the problems of the internet. Everything that's on the internet is not accurate, which shouldn't be a surprise to anybody. Um, and the other place that has a big issue is in technical domains. A lot of technical content is locked away. That's very true in medicine. Uh, a lot of journal articles, for example, are locked behind paywalls. You have to subscribe to a journal in order to access that content. And so it doesn't learn from that information. It learns from, excuse me, free things instead. So there's a number of ways that people are already using large language models, and I've color coded them here, at least my framework of whether I think they're OK or questionable or really not a good idea. Someplace that I think would be a fantastic application of this and is already starting to happen is writing patient education material. Those documents that we print out right now in the after visit summary that are very generic for conditions could be designed to meet exactly what's going on with the patient that's customized to their age, to their sex, to their other pre-existing conditions so that the information that's being provided is very specific, not so generic. Any of us that have ever received a transfer patient from another hospital and you get that stack of papers that's several inches thick, that takes a really long time to go through a tool like this would be fantastic for reading through all of that and providing a summary. Most of the information that's on those 1000 sheets of paper is not helpful, but some of it is, and this can find that helpful information very quickly. It's already being used in a lot of EMR systems to do billing codes. It's reading the charts. It's looking at labs, and it's coming up with the appropriate billing code. Um, and it's replacing in many cases, builders and coders, and it's also very helpful when you're doing writing tasks. Some, something that I found it incredibly helpful for is when you have a limit of words, if you need to write an abstract that's 150 words, for example, and you wrote something that's 157 words and you just can't figure out how to get those last 7 words out of there. It's so creative and it can figure out exactly how to do that in ways that you wouldn't have even thought of. And it's still your own writing. It's still something you came up with. This is basically just super spell check or super grammar check. Places where it starts to get questionable is in things like generating daily progress notes or any of the notes that we write in an electronic medical record. It would be very, very good at doing that. Um, it could generate a fantastic looking note every single day, but those are supposed to reflect our medical decision making and the things that we actually did for the patient. And if the computer's writing it, it's, it's not really reflecting that anymore. It's what the computer is doing. It also is very good at generating ideas. And if you're writing papers or creating new things, I think it's probably an OK idea to use them to generate ideas, a brainstorming session. But when it starts actually writing the material for you, that's different and that's, that's much more of a gray zone. There's definitely some repetitive writing tasks that we all do, writing letters of recommendation or um reviews or evaluations. It would be very helpful to have a computer do these things for us, but that's not really us evaluating them anymore. And then in some cases, things like some medical journals or the NIH have actually banned the use of these tools. You're not allowed to use them at all. Another place that we run into problems is what's called hallucinations. So large language models are making predictions, and it's making these predictions based on the information it knows. And sometimes, because it doesn't understand it, it comes up with things that are humorous, and I'll show some funny examples here, but they can also be. Dangerous if we're counting on it in a situation where there's patient safety involved. So on the left here, I've asked Chad GPT, what's the world record for crossing the English Channel on foot? Now that's a body of water between England and France, so you can't catch, uh, crossing on foot. But not only does Chat GPT humor me and say, Yeah, this is possible, but it actually gives me the name of some guy, Christoph Wanderstracht of Germany, who did this in 14 hours and 51 minutes, which, you know, is completely made up. But if you didn't know this, if you didn't know what the English Channel was, this would sound like a very reasonable answer to that question. For a while, Chad GPT also had a lot of trouble with the spelling of states. You would ask it to identify a state that has the letter Q in it, and it would tell you Connecticut. Uh, I guess it has that sound in it. I'm not sure where it came up with that. And then something that was a very popular meme going around the internet for a while, the Google AI was recommending everybody eat one small rock every day for minerals, and somebody tracked this down, and it turned out it came from a blog, a humorous blog that somebody posted. Uh, recommending humorously a satire that people should eat rocks every day. But yet GPT thought that was, or in this case Google AI thought that that was a very reasonable recommendation, which is why I put it in there. This is true even in domain specific tools. So there are certain AI, um, uh, large language models that are being sold right now. So the legal field has many of these you can actually buy them and they're very expensive, and they also hallucinate despite the fact that it's something you're paying a lot of money for. Um, so over on the right here, this is just two examples from uh uh um a scholarly publication. On the left, they asked why did Justice Ginsburg dissent in Obergefell, which is a famous Supreme Court case, and if any of you guys know that one, Justice Ginsburg did not dissent in that case, even though this chat, this chatbot said agreed with our prompt that they did. And the response is also something about copyright law. This was the Supreme Court case that granted same-sex marriage had nothing to do with copyright. Law at all, but if you didn't know this case, the answer that it provides seems very reasonable and realistic. Same thing on the right. The, the prompt is something about, um, special laws or regulations in Connecticut. I don't know why Connecticut again, uh, for online dating services, uh, and the answer gives, uh, the, the chatbot gives this answer about Connecticut general statute 42-290, and Connecticut, number one has no regulations about online dating services and it also doesn't have a 42-290. It doesn't exist. It's not even a law. Um, and there are many high profile examples of lawyers that have been burned in court where they have used AI to generate their briefs and submitted them to the court containing. Inaccuracies like this that were generated by AI and I've gotten in quite a bit of trouble. Um, another sort of humorous example, the Google AI really likes it when you ask for comparisons between things. I think part of it, it's it's algorithm to try to sell you things, but you can put anything in there. So in this case, we're comparing an air fryer to the Ottoman Empire, and it's happy, happy to generate a a table for us comparing the pros and cons of both of these things. Uh, in medicine, um, there's a group that's come up with this tool called Medhalt, and it's a tool that's designed to help us test medical large language models for this brittleness, for this inaccuracy. It's built using a very large database of medical information, um, that kind of uh expert content that the general models can't really get, get access to. And so in this particular case, they're testing one right here, and they're giving an example about a pregnant patient who was bitten by a tick and seems to have a tick-borne illness. Now, the answer that that the the computer gives is tetracycline, which is a perfectly reasonable antibiotic for a tick-borne illness, unless you're pregnant, in which case it's really the wrong antibiotic to give amoxic. is the correct antibiotic in that case, but the computer gets it wrong. It gives a very reasonable answer, but the wrong answer. And so this is one of the places we need to be really careful with this kind of, um, these kinds of tools. So so far many of the LLMs that are used in medicine have been these off the shelf tools. They're not specially trained for medicine. Which means that they're not trained both the things that are locked behind paywalls, the journals, but also they're not supposed to be trained on patient data either. EHR um private health information data isn't supposed to be in these training data sets. Although interestingly, as the time has gone on, we found out that sometimes they are accidentally. Um, this is an article that I clipped from Ars Technica. This journalist found her face in a trading database. It turns out that she has some particular dermatologic condition, and the dermatologist takes pictures of every single one of their patients to monitor progress as they go through their treatment, and that database of faces got leaked online. And so then it was. Used to train um uh large language models or or other image recognition models on dermatologic conditions. So there's a lot, a lot of stuff that is happening right now that's still very quasi-legal, uh, or not legal, um, but very poorly controlled or regulated, and I'll talk more about regulations a little bit later. Um, but that's not stopping, uh, people. Epic is pushing AI quite a bit in many of their new initiatives. Uh, most recently they have this new tool that's available in testing. Not all hospitals have this yet for MyChart. So any of you who, uh, get MyChart messages and respond to them, uh, that takes a lot. Time to do that, particularly if you have a very busy practice. And so this new, uh, chat tool in in Epic reads the note from the patient. It reads the patient's chart and it actually generates a draft response for you. You don't even have to write the response message to the patient. It comes up with it on its own. It doesn't send it on its own. It still needs to be reviewed by the doctor before it gets sent out. Um, but interestingly, when they survey patients where this has been tested, the patients actually vastly prefer the AI generated response to the doctor generated response. Um, OK, so we've been through maybe the trough of disillusionment. Let's talk a little bit, maybe with some, some more positive examples about getting to the slope of enlightenment. So I'm gonna talk about a few, uh, research and clinical examples in the NICU space where, um, where I spend most of my time. This is a project that we did a number of years ago where we were trying to improve uh MRI scoring system. So one of the patient populations we take care of are those with hypoxic ischemic encephalopathy or HIV or also known as birth asphyxia, and part of the standard clinical care for these patients is to get an MRI. This is a really important part of the care of these patients, something that parents are really counting down the days until it happens. It's usually around day 4, excuse me, 4 or 5 before they get that this MRI. And the standard clinical workflow is that the MRI is done, a radiologist subjectively interprets it, writes a report, and then we subjectively interpret that report for the parents. So there's a lot of opportunity for, um, a loss of information along those steps and, uh, variation between providers. Uh, a number of years ago, one of our, our fellows, um, developed an image scoring system, uh, Schmikerretti, who's in, uh, Chicago now. Uh, that's a research-based system and it's really sophisticated. It looks at both sides of the brain and all different sequences and all different parts of the brain, and it is really good predictive performance for outcomes, but it has 81 different elements in it. So nobody uses it. It's used in research only. It's not used in clinical practice because it just takes too much time. No neuroradiologist has time to fill out 81 individual elements for every single MRI they look at. So my thought is that we could probably simplify this. We can use AI to help us identify which parts of the scoring system are the highest performing and take out things that are duplicative or don't perform very well. And so in this table, you can see we didn't just come up with one model, we actually came up with 5 models, but, but I'll draw your attention to the top one, which is the original system that we developed that has about a 40% accuracy for predicting cerebral palsy, um, when, when children are older. The AI model all the way down at the bottom, our most refined version is 85% accurate. So we're we doubled the accuracy and we also reduced it from 81 elements to 3 elements. So we dramatically reduced the amount of time somebody would need to spend on scoring these MRIs and actually one of the the elements is gestational age, which is something a radiologist doesn't even need to look up. They only need to do two of these things. So it really improves the both the accuracy and the time spent on interpretation. Uh, this is a project, uh, done by, uh, Wisam Shali, who's, uh, one of my friends at McGill in, uh, Canada, and they've developed a tool to try to, uh, improve extubation success. That's one of the challenges in the NICU. When do we take the breathing tube out, especially in our preterm babies, and it's hard. We have guidelines that we use to follow that, but we're not always right, and we wind up reintubating patients, and they often do worse as a result of that. Um, so this tool that they developed in McGill, they look at vital signs, they look at things from the chart, and they improved their success from 82% to 93%, which, um, is, you know, is not a very large number, but in terms of something that's this important, more than 10% improvement is is really dramatic. Uh, Tom Hooven, uh, is a neonatologist at Columbia University who is interested in trying to use, uh, a bunch of different areas of, of research and combine them all together under AI. So necrotizing enterocolitis or or NEC is a problem that some of our preterm babies face. Very serious infection of the intestines that can require major operations or even uh cause death in some babies. Um, and one of the big problems we have is that other than being born premature, we don't have a good way of predicting who's gonna get it. It affects 10 to 15% of our patient population. Um, and other than, uh, providing breast milk, which reduces the risk, but doesn't eliminate the risk. We don't really have a good way to to predict this. Tom looked at some research from microbiome that's that's found that babies that go on to develop NEC have very different bacteria that live in their gut. Um, that and, and that's an interesting finding, but not a predictive tool. He put this together with an AI model and you can see going back 60 days before the event occurs, those babies who develop NEC and don't develop NEC have a very similar risk score. So he's calculated this risk score based on the microbiome. But over time, these lines start to diverge, and he can tell more than a month before a baby develops NC that they're going to be distinctly different from those that don't. And this is all based on an AI model. The patients themselves are asymptomatic. You can't tell them apart at all. Um, this is entirely from this tool analyzing basically what's in the diaper to figure out what's going to happen more than a month later. Similarly, uh, Brent Sullivan is one of my colleagues from UVA, uh, who developed a tool that predicts sepsis. So this uses minute variations in heart rate, so the length of time between successive heartbeats to identify sepsis. And it turns out almost 24 hours before sepsis happens, you can see these changes. This risk score starts going up, um, which is far more predictive than. Of the other types of models that are out there and then a randomized trial where they use the device that they've developed, this is actually the only one of the tools I'm talking about today that's commercially available. uh, they reduce sepsis related mortality from 20% to 12%, so a 50% reduction in sepsis related mortality by having this uh tool available. And then another project, uh, just to bring back to ones that we've done locally here, um, this is another AI tool. This is looking at severe IVH. So another couple, a lot of these things are related to complications of prematurity. Bleeding in the brain is another really big one, and we know some of the risk factors being born early is of course a risk factor, but we have nothing that is dynamic. Low oxygen levels or hypoxia is a is a big risk factor, but something we generally know about only. Later, and when we tried to use tools things like pulse oximeters or nears monitors to try to reduce the amount of hypoxia using a threshold level, so basically turn up the oxygen when the level drops below X% or Y%, it doesn't help. Um, and there have been a bunch of randomized trials that have shown that this, this is not helpful. Our thought was that maybe we can use the computer to identify patterns of hypoxia, how the numbers are changing over time to uh predict these outcomes. So, uh, we got, as, as the case with most AI projects, we tried to get as much data as possible. So this is 600, more than 600 babies from our NICU here. Uh, at children's and more than 95 million pulse ox measurements. So there's a lot of data that went into this. We made a fancy AI model, and the whole point of it was to find patterns in the pulse oximeter that are very common in babies that develop IVH, but rarely occur in those that don't develop IVH. Um, so the computer is trying to sort through all these patterns on a second by second basis over over the 1st 7 days after the baby is born to see which ones are important. And so we came up with these patterns, and these are not ones I cherry picked from looking at through the record. These, these are the ones that the computer found and told me were important. Um, and interestingly, these are some of the patterns that we recognize clinically already. So these are severe desaturation events. These are babies that have normal saturations and then their saturations plummet down to the 60s before recovering again. These are examples from two different patients there. Another variation that I found was important are these rapid alternating events where the stats are normal and then low, then normal then low then normal over very short time periods, over 60 to 120 seconds that are really representing this ischemia reperfusion events that are happening over and over again. Sustained hypoxia is another one. The one on the left is a baby who sat start high and get lower and lower and lower and lower. The time scale is a little bit longer and these, these pattern classes. The one on the right, this is a 16 minute window. The sat stay below our clinical goal of 90% the entire time. and maybe briefly touches it for about a minute and then goes back down again. And then the last one, which was a really interesting pattern. Um, uh, is this rapidly oscillating baseline even without desaturation. So this is just a very unstable oxygen saturation. So this is a project that's very much in progress right now, um, and something we're hoping to, to move forward, um, even further as, as the year, uh, goes on. And then the the final one I'm gonna talk about is one that we're hoping to get very close to a device, uh, later this year. Uh, so opioid exposure is a big problem in the United States. It also is a problem for our babies that are born to mothers with opioid use disorder. Right now the standard of care, like many of the things I've talked about today, is subjective assessment, uh, at the bedside, which means that the scores often will change from shift to shift or day to day, even if the patient themselves is not really changing, just somebody's different subjective interpretation. Of, um, of what's going on at the bedside, but those differences mean that patients often wind up staying longer in the hospital. We start treatments, uh, and then stop treatments right away as, um, as the scores fluctuate so much. And so our thought is that we can do this objectively. We can use data to figure out these patients, um, uh, in a more consistent manner. And so this AI model that we built was based on heart rate. So we wanted to stick with a very simple vital sign. Uh, and we looked at many different aspects of heart rate and found that there were really 3 different parameters that were different about babies exposed to opioids compared to control babies, those that weren't exposed to opioids at all. The heart rates are higher, uh, at baseline, but not abnormally high. They're just higher than the controls. They have more heart rate variability, particularly deceleration, and you can see this uh along the top here. You can see this baby has these, um these big drops and then comes back up compared to the baby on the right here where there's not a lot of difference between the high and low values. But in both cases, these heart rates are still normal. These are not abnormal heart rates. It's just the pattern that's different. Uh, and you can also see at the bottom, this is a histogram of the heart rates. It's skewed very much towards the left compared to this much more normal bell curve shape on the right here. And so we put this algorithm inside of a machine, uh, and it's well calibrated which was shown on the left here, but this is this panel right here is the one that I think is the most interesting uh we compared, so we developed a risk score out of these heart rates and not only does the the control group, which is gray right here, have a lower risk score that differentiates it from the babies that have symptomatic withdrawal, which is in green. But we can also differentiate the babies that are asymptomatic. They are still different than the control babies, even though they are not exhibiting any symptoms we can see clinically at all. So this is detecting silent or hidden differences in patients that we wouldn't be able to otherwise see. Uh, this is an example of what the score would look like over a period of hours. Uh, the, the dash lines are the normal values, so. Baby looks fine, fine, fine, and then they start withdrawing and you can see the risk score go way up there, um, so later this year we're hoping late spring or early, uh, summer, uh, we're gonna actually start early pilot testing at 3 different NICUs in the US. It'll look something kind of like this. um, this is just, uh, uh, a draft drawing here, um, but just an example of how we can bring these tools, uh, to the bedside to hopefully improve outcome. Um, so the final topic I wanted to talk about is a framework. So all of these new technologies that are coming in, uh, at faster and faster pace, how do we incorporate them into our everyday practice, particularly if we're not deeply in the computer science field? Well, I can tell you, you probably are actually using some AI right now and you don't even realize it. Um, some of the clinical prediction tools that we use, things like the Kaiser sepsis score and taking care of babies, the Apache score and, uh, adult, um, medicine and ICU setting, the news and the Pew scores, those are AI systems. They're rudimentary AI systems, but they are AI nonetheless. Many of the medical devices we're using, anybody who's ever gotten an EKG, you see at the top, it's got that little printout with the text where it tells you what's going on in the EKG. That's AI that did that. And that's been around at least since I was in medical school, so it's been around for a while. Um, and some of our other medical devices like pulse oximeters use AI, um, in very much a black box way. We don't realize that it's doing it, um, but it's using that to acquire the signal, improve the signal quality, and AI is also very much a big part of, as I mentioned earlier, a big part of Epic, how it is trying to, um, uh, improve our, our billing capture. Um, it doesn't replace a coder yet, but I, I think we'll probably get there relatively soon. But where are the guard rails in this from a regulatory perspective. And unfortunately they're really all over the place. Um, FDA in late 2021, early 2022 released this diagram here talking about software as a medical device, um, and their take at that point was that all software, uh, that's going to be used for patient care is a medical device and should follow the same medical device approval pathway that every other medical device does. The problem with that, of course, is that these, um, these tools when you put them in Epic at one hospital and you put them in Epic at another hospital, the FDA would tell you you need to get separate approval for each one of those. It would be an enormous amount of regulatory burden, even if it's the exact same tool that's being used at each, each center. Not surprisingly, manufacturers pushed very, very hard against that. They rescinded these rules in the fall of last year, and there's been nothing to replace it at all. So right now, it's really just the wild west. Um, these tools are not, um, not regulated in, in any way, um, uh, uh, shape or form at all. So I think what we can really do from here is ask these questions as new tools come to us, and this is regardless of where we are, um, in, in the healthcare system, whether we're at a top administrative level or at the bedside taking care of patients, uh, these tools are going to keep coming to us and there are questions that we can ask as we evaluate them and how we decide how to use them that I think will, will help us use them in in the best, most productive way. The first is understanding what requires regulatory approval and that's like I just said is is really uh an open answer um some of this is regulatory approval from the government itself some of it is I I think there's gonna be a role internally our own healthcare systems policies about how we use these tools and those may vary from institution to institution, from hospital to hospital, maybe even unit to unit. Depending on the nature of the patients we take care of, um, uh, folks that take care of patients with, uh, very sensitive issues, certain infectious diseases or psychiatric concerns may have a different set of framework they work on compared to, um, uh, more, uh, standard outpatient, uh, primary care office um there there may be differences in how we use those based on our locations. The second one, and one that's really, really important is how do we build trust with providers. These tools are going to represent a very big change in how medicine is practiced. Um, we're going to go from relying entirely on ourselves and our own knowledge, what's in our head to something that happens all external to us, and we need to trust it and people in general don't have a lot of trust for machines like this and that and that's appropriate from the hallucin hallucination examples I showed earlier. I think that's that's an appropriate distrust to have. But at some point we need uh tools to help us as our patient volume gets bigger, our patients get more complex. We can't handle it. We're fooling ourselves if we think we can actually handle it all on our own without some assistance. So the goal, um, and, and those of you that know me have heard me say this, uh, a lot of times it's one of my favorite expressions we should let computers do things that computers are really good at. And save the things for humans that humans are really good at. And we shouldn't waste time having humans doing computer tasks or trying to train a computer to do a human task. That's those things are, are really just a waste of time at the end of the day. Um, but in order to do that, we need to trust the system. And In many cases, that means you actually need to see it happen. Uh, there is very much this chicken or egg problem in implementation science right now with many medical devices where in order for a provider to trust an algorithm, they actually need to see it operating. But how do you get it in front of them without all of the work, the contracts and the purchasing and everything that goes into making it appear in front of the provider, uh, and that, and that's, that's a really difficult. Um, task to, to solve, um, there's an algorithm that we use in our NICU that predicts the probability of hypercarbia, high CO2 level, uh, and it's something we rolled out as a part of another platform we have, uh, and very early on when this, uh, algorithm came out, I had two different, uh, of my partners come to me and say this thing says there's a 100% probability of high CO2. I don't believe it. The baby looks completely fine and. Uh, I told them to go get a blood gas and it turns out that it was right in both cases. The algorithm said something was wrong and it actually was wrong. But without going through that real life learning experience of seeing the algorithm, not trusting it, and then realizing it actually was right when your own instincts were wrong, um, that's, that's a difficult task to replicate and it's something at least right now, that almost has to be replicated for each individual person to go through this learning process, this experiential process, to trust it for the future. Another big question, uh, and, and we're, we're not even trusting it we certainly are not ready for it to provide autonomous care, but at some point this is going to start coming up. When can AI systems operate on their own? So it'll read your MyChart message and answer the patient without you even knowing when are we ready for something like that? Um, there is a company that. Um, that, uh, uh, I've, I've talked to about a product they're developing, um, the Department of Defense was really looking for a way to stretch their pool of, uh, anesthesiologists and CRNAs, particularly in the deployed setting. They have difficulty getting people to fill those roles, um, and so they're looking for technology to try to expand it. And so this company was developing, uh, an anesthesia pump that would uh provide propofol during a procedure. And was a closed loop. So it had a monitor on the brain that would measure brain activity level that was wired to the pump. And when the patient started to wake up, it would squirt in a little bit more propofol. And when they got too deeply sedated, it would back off a little on the propofol and it was all a closed loop. No human interaction was required. And the last time I talked to them, they were still trying to get the FDA to approve this device, not surprisingly, but, but this is, but the point of the story is that it's already there. People are working on these technologies that take humans entirely out of the system. Um, and that leads directly into my last question. is what happens when it makes mistakes. I think when, when people make mistakes, there's a pretty well established pathway of how we address that internally through our our various um quality programs that we have ways of debriefing and and trying to do root cause analysis. But when a computer makes a mistake, How do we have that framework to to figure out what went wrong and how can we do better next time? And then of course, there's all the legal concerns that go along with that too. I don't have answers for any of these questions. I think they're just, I think they're just good ones to move forward. Um, so just to sort of wrap things up, um, there's a few points that I'm hoping you take away from this today. First, these models are incredibly powerful. In some instances, they seem like they are as smart or even smarter than we are. But it's important to remember that they're still at the end of the day a computer and they will make mistakes and they don't really understand what it's talking about. It's just predicting probabilities. There's a lot of really exciting research in this area, um, much of which you will see today over the course of the symposium. Uh, and it has a lot of promise for us. The, the way that it's going to be able to help us focus our efforts on the things that we are good at to reduce, um, cognitive burdens on us that shouldn't be there, um, to provide better and more consistent care between patients and to identify patients they're going to have problems before they have problems. These are really exciting, uh, potential applications for, for these devices or these technologies that are really going to fundamentally change the way we practice medicine. But again, I think that this is a time where caution is really urged that we need to be careful and take it slow, but also not too slow at the same time. Um, so with that, I'd be happy to take any questions. Of course, I'd like to acknowledge all the folks in the lab that help make, uh, all of this work possible, um, in the many sources of funding too. Hey, I had a question about the IVH, uh, SPO2 AI. Does it recognize like a false reading, you know how when a nurse is looking at that they have to make sure the wave form and it's not just that the pulse ox is off the finger. Yeah, that's a great question. So this, uh, this model right now is not looking at the actual waveform data itself, but it looks over a very long time period. So it's looking for trends over the shortest time period it's looking at is over 4 minutes. The longest is over 30 minutes. Uh, so most of those short term errors are are truly short term. They last for 10 seconds or 20 seconds, uh, and, and are not picked up in persistent patterns like this, um, but, but that's a great question though. Well, uh, one of my concerns is the hallucinations and the ones that you presented were pretty obvious, but there could be just a slight difference that someone might not catch. And then my other thing is hackers. And how would we protect AI from hackers. You know, getting into the AI system and changing how we treat the patient. Yeah. Right, 2 rocks every day, um, uh, those are, are both excellent points and, and with hallucination it's really challenging. I, I think that's why these tools at best right now are really companions for experts, um, that, you know, they, they help us to sort through problems or get to answers quickly, but you still need to be a content. Expert in it, um, I use it, uh, these tools for some of the coding projects that I do when I'm writing computer code, but I already know how to do that. That's not the computer's not doing something that I can't already do. So when it makes a mistake, I can tell it made a mistake, but you're right though, the more nuanced the error is, the harder it's gonna be to pick up. to the point where I don't believe any pictures or see on. Mhm. I don't know that it's not right, right, and, and it's interesting there, I, I know there are universities that use tools now to try when when students turn in assignments and papers they use tools to try to catch whether it was generated by AI or not. But the thing that's fascinating to me is those tools themselves are AI. So I don't know if the AI's are willing to rat each other out or not, but it's, but it's an imperfect system predicting another imperfect system. uh, and, and sometimes you can, uh, identify things in. Uh, pictures, for example, uh, AI generated images. AI for some reason has a lot of trouble figuring out that humans have 5 fingers. It often will generate pictures where they have 6 fingers or 4 fingers, and that used to be a tell, but as they get better and better, that problem's going to go away. So we're not going to be able to recognize it. As easily as we could before and then your second question about um about hackers, I think is a great one and the normal information security rules still apply very much to AI just like anything else, um, but, but I, um, there are several different things that aspects of this that I think could be an issue one. Um, people that put things on the internet that don't want their material incorporated into AI have developed a technique called poisoning. So there are things you can do to change an image or to change text so that AI's get scrambled when they try to interpret it. Uh, and so it actually decreases the accuracy on that kind of information so it makes the system worse, uh, and the AIs don't realize they're being trained on this data. They just kind of run through it just like everything else. Uh, the other is, is these, these systems really are giant black boxes we can't see what's going on inside of them. Um, a new push though that may help with with things like this, um, is that the developers of them are creating. Um, a way to display the what's called the chain of thought or chain of reasoning behind the model where it's, it's like the old school problems where it's not good enough to just write the answer to the problem. You also have to explain your answer and, and that's what these tools are going to increasingly be able to do. It's gonna have to justify why it's saying what it's saying, um, and those, those things will just get better and better over time, but these are absolutely growing pains right now. I mean, I, I think we all can appreciate that, you know, like neurologists will find neuro neurological issues, cardiologists will find cardiac issues, you know, when when we have all these different AI things potentially up and coming, you know. Is there issues with those, you know, like, hey, this AI is supposed to find sepsis. Well, there's all sorts of different things that can look like that and you're you know, not saying that you're exclusively diagnosing based off this information, but it's just like I wonder, you know, we'll find that, oh shoot, you know, could be sepsis could be these other five things, and you know, if, if you're building a tool to look at a thing specifically you're probably gonna find that thing more likely than you wouldn't maybe. Yeah, oh, not so much a question, just a thought, but yeah, no, it's, it's, it's a great comment, uh, and, and something that I think speaks to the importance of, of teams that develop these of being multidisciplinary teams that include people from different areas of medicine, both data science and medical backgrounds and engineering backgrounds, um, to help broaden the scope of these, these, uh, um, these tools, the SE. Prediction algorithm that I showed earlier, the folks at UVA have since realized that it is good at predicting sepsis, but it's also very good as a general patient decompensation tool. Uh, it's actually better at predicting patient decompensation for any reason, um, than it is at predicting sepsis. Um, so some of this is just experience as we use these tools we we realize, oh wait, we can actually use it for, for that thing too, um, and, and there'll be more of that certainly in the future. There are a couple of questions um in the chat. The first one is, is there fear of over or under treatment in the response to treating the machine and not the patient as touched on clinically they look horrible, but the patient looks fine or maybe the flip side, the patient looks horrible, but clinically they look fine and wanting to trust that gut feeling. Yeah, that's a really interesting question. It comes up, uh, not just with AI but also with many new medical devices. Um, the, the standards, the way things are approved now are are rigorous, which is good, um, but it also means that things are held to much higher standards than what they were previously. Uh, and so statistical concepts, things like positive predictive value and negative predictive value. If something is actually positive, how often does the machine actually say it's positive and it's actually negative. How often does it say it's negative? Those are really important metrics to keep track of, but there are things that we use commonly like a CBC, for example, actually has terrible positive and negative predictive value, but nobody questions ordering a CBC. But when we try to apply, and maybe some some people actually probably do and should appropriately in the right setting. Um, but it's, it's much more widely accepted than a new technology that we're not as familiar with. Um, so I, I think part of it is, is part of the evaluation process is providing these metrics to give people trust for the the tools. And then I think there is an opportunity to use these both to do things to be proactive, to provide interventions and to also reassure us to not do interventions. It's often really hard to do that, to not do that sepsis evaluation, especially because the machine is telling you not to do that sepsis evaluation. Um, I actually had a question as well, so this is fantastic. Um, what I noticed is that the models that seem to be most robust were the ones that, for example, predict future events. So for example, cerebral palsy, necrotizing ne colitis, so things that if you have this information potentially this could happen. What I'm curious about is that so for things that have required more immediate decision making plan, you know, extubating a patient. How does that compare with the decision making of a senior clinician and how do you prevent, you know, how do you account for confirmation bias in all of this? So, so there's, there's not a lot of studies that have been done like that so far, but it certainly is, is the future of this, um, uh, Wisom's, uh, uh, paper that I showed that the comparison, the 82% success rate was expert clinicians, uh, so that was what they had before they implemented the, the model. So it was a run-in period basically. Um, and so the, the computer was able to improve that over time, uh, compared to, uh, what the expert clinicians were, but things like, um, uh, you know, sepsis, for example, um, you know that those things are, are hard and the more nuanced, um, and complicated the clinical presentation of something is, the harder it's gonna be to compare apples to apples. Any other questions? Oh Mm. about uh trust in new technologies. What is, what are Um, for these new technologies and with the state of the current administration and it seems like some pull back on some of those. Government funded programs and even other regulating bodies, um, is that a concern for this moment in time with this technology needs so. Uh, so right now the, the regulations for these for any, uh, medical technology that contains AI actually follows the type of device it is. So if it's a if it's an actual medical device, if it's a monitoring device or an implantable device, it follows the existing FDA regulations for those types of devices. But if it's just software, so if it's. Thing that's running in a web browser or running inside of Epic there currently is no regulatory framework for that at all, um, and so many places are, are implementing all sorts of algorithms. I, I think we all see that every day there's some new clinical decision support tool that's added to Epic, and there's no regulation, uh, behind that at all other than whatever people have institutionally. Uh, I, that's a great question about what's, uh, gonna happen in the future. I, I don't know. I, I, I do know there, there are many parts of the FDA that were dedicated to, um, specific review of AI technology, uh, have been let go, um, so, so much of the expertise to review these areas, uh, has been lost in, in, uh, recent weeks. Um, I don't know if that means that these things aren't going to be reviewed at all or they're gonna be reviewed by people that are non-experts, in which case it might delay them quite a bit because people don't understand this technology that has historically been one of the challenges. So there's in the United States, uh, for seizure detection algorithms for neonatal seizure detection algorithms, there's one, there's a single algorithm that is that is FDA approved. Um, and it's a rudimentary AI based model. There have been a number of people that have tried to do that since then, and the standards that they're held to are extremely high. Um, so it requires it to exceed the accuracy of a 4 neurologist independent annotation, and you can't even get 4 neurologists to agree on what a seizure is most of the time. So I don't know how you're supposed to exceed that, um, but, but that's, but that's a standard recommendation from. Somebody that's not in the AI field, and there it's not specific to those tools. So I don't know, it's an excellent question and and something that really is, is very much open now. I'm gonna build on that question. So you talked about seizure detection. Is it better than a dog who detects seizures? Uh, so it depends on the experience of the doctor. So people have compared these algorithms, like people who have dogs, like people who have companion animals detect seizures because also like has anybody ever compared that? I don't know it would be interesting to see that I mean that is actually some people consider that more of a standard than neurologists, right? You know, families rely on dogs to prevent, you know, to. Yeah, and for medical events, how do those models compare to something that's actually a lot more simple and seemingly reliable, right? Yeah, and I think there's many sorts of different comparisons and validations, uh, that that can be done. Um, you know, and that, and that's, uh, I think an example of of a type of technology or a way of, of making predictions that's nonstandard, non nontypical, but still has a lot of value in it, uh, and that's that all sort of gets at the heart of this is is learning new ways to do the things that we've done very subjectively in the past. Anything else? This is a great discussion. Uh, we're gonna be headed to lunch just as a reminder, uh, for those, there are the abstracts that are in the great rooms A and B and bring, um, there are you bring your, uh, card for attendance raffle tickets. Alright, thanks so much.