Why Huge Gains in Drug Discovery Technology Have Led to Longer Drug Development Times

Two pharma and research experts talk about their surprising explanation of the troubling contrast between huge gains in the brute-force power of drug discovery technologies and huge declines in innovative output efficiency.

A few days ago, Derek Lowe, a prominent and very no frills objective writer in the pharma space, published a blog post about a very important research paper by 2 consultants in Biopharma. He hailed “it’s one of the best things of its kind I’ve seen in a long time.” That article talks about the gains we’ve made in drug discovery but that it hasnt improved how drug development times.

We know that drug discovery is becoming harder over time, but these guys have an unusual answer as to why it has become this way. We knew we had to find out more.

We interviewed Jack Scannell and Jim Bosley for a Q&A about the paper they published in PLOS ONE which is called “When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis.”

Jack and Jim explain how some aspects of drug discovery technology have advanced while other areas lagged, how this has caused a problem with lower output, and some of the mathematical reasons behind this. They discuss how Pharma companies pick which drugs to target and some of the issues behind what they go after. They also answer some more general questions on the Pharma industry as a whole.

About Jack Scannell

Jack is famous for coining the term “Eroom’s Law.”

Tell us a bit about you. Background, education, work.

I will let Jim tell you about himself. I have had a somewhat meandering career.

I started out studying to be a medical doctor (at Cambridge in the UK), I did a PhD in computational neuroscience (at Oxford) during my medical training and then went down an academic career track for a few years after my PhD. I never qualified in medicine, for which the English National Health Service should probably thank me.

After academia, I worked for a management consultancy firm, the Boston Consulting Group, where I did a lot of healthcare-related work. This led to a job in the UK equivalent of Wall Street (“The City”) working for a firm called Sanford Bernstein. There I was what is called a “sell side equity analyst”. It was my job to guess whether healthcare stock prices were going to go up or down. I was at Bernstein from 2005 to 2012. In 2012, I quit, in part because I had become too interested in the problems of drug R&D to be spending all my time doing such a financially oriented job.

I then spent a short time running drug discovery at a little biotech company near Oxford, which was exploiting some bioinformatics methods I worked on back in my academic days in the 1990s. I left the biotech in 2014, and have been self-employed since then; doing a mixture of academic work (such as the PLOS paper) and consulting (e.g., for investment firms, the drug industry, and the public sector).

About Jim Bosley

Jim Bosley

Tell us a bit about you Jim. Background, education, work.

I was trained as a Chemical and an Electrical Engineer, am a registered Professional Mechanical Engineer, and practice Biomedical Engineering. The common thread in all of this is an interest in systems behavior.

It turns out that whether you work on centrifuges to enrich uranium for the US DOE, or automate fermenters for Genentech’s early products, or develop diabetes drugs with Pfizer, the systems behavior that matters is comprised of the contributions of many elements. For the last ten years or more, I’ve focused on creating systems models to integrate relevant elements of knowledge of diseases and drug action to yield predictive models, as a consultant to the pharma/biotech industry. “Recent examples include working with a group studying the effects of acute malnutrition and its effects on early growth and development for the Bill and Melinda Gates Foundation, and modeling components of Ewing’s Sarcoma as both a Steering and Technical Advisory Board member of the KressWorks Foundation. I, and many different colleagues, have presented over the years the results of work with companies like Pfizer and Merck highlighting how predictive mechanistic models support better (higher PV) decisions in drug development.

If you were telling a 13 year old about your PLOS paper, how would you explain it? (We aim for increasing literacy as such, it might be tough, but your best is appreciated it)

Jack:

Discovering new drugs, like Cenforce 100 is a bit like finding small islands in a big ocean. Academia and the drug industry did not bother studying the maths of ocean navigation. Therefore, they invested a huge amount in the boat’s engine (brute force efficiency, or “horsepower”) but neglected the accuracy of the compass (the validity of the experimental methods; how well they mimic the effects of diseases and medicines in man). R&D efficiency declined because people spent too much time travelling at great speed in the wrong direction.

We think that the the compasses got worse for two reasons. First, the most valid experimental methods – the ones that accurately model human disease – lead to the discovery of cures, and once you have a bunch of cures for the same disease, it becomes both commercially and academically boring. People stop working on it. Researchers are left with the diseases for which the cures are bad, which tend to the diseases where the experimental methods are less valid (i.e., they don’t mimic the disease in man, or the effects of medicines in man). Second, there has been a scientific fashion for “reductionism”; the idea that it is generally better to do biomedical research at the molecular level. Our guess is that lots of fashionable “reductionist” approaches fail to predict systems-level behaviour.

To give an analogy here, no-one thinks the “best” way to study aerodynamics is to look at the individual molecules that make up air or the wings of airplanes, or the “best” way to study voting intentions is to look at the molecules that make up voters. In these cases, it is often useful to look at other levels of organisation. But for various reasons, the dominant intellectual model in biomedical research is now “reductionist”. It is easier to get funding to study genes and molecules, than it is to study physiology, or pathology, or the clinical responses of real patients, etc.

To put it yet another way, people over-estimate the performance of current methods because of survivor bias in R&D projects.We celebrate the few projects where the basic science predictions appear to work and drugs succeed in humans, and quietly forget the far more numerous projects where the predictions turned out to be wrong. Creationists who argue that the human eye is so beautifully designed that it cannot be the result of selection acting on variation make the same kind of intellectual error (albeit in a much more extreme way)..

What we have not done in the paper is any serious attempt to split the blame; how much is due to exhaustion of the best methods and how much is due to changes in scientific fashion and the rise of “reductionism”? We don’t know the answer to that question yet.

If I have not bored you too much already, Tom Calver at Oxford University has done a good job at summarizing the work in a digestible way on the Oxford science blog, in a short article called “Better Engine, Worse Compass.”

Jim:

I would add one thing: We must create models that allow more effective searches for drugs, and better predictions of how they will perform in a clinical trial. So my motivation has been in inventing, refining, applying and encouraging adoption of better physiological radar systems. That is, high PV systems.

What information were you most surprised by in your analysis of pharma research and trials?

Jack:

I am most surprised that more people are not surprised by the contrast between huge gains in input efficiency and huge falls in output efficiency in the biomedical research enterprise. You might have thought that science funding agencies would realize something was wrong if the activities that experts say are important for drug discovery have become hundreds, thousands, or billions of times cheaper while the cost of discovering drugs has gone up one hundred fold.

Jim:

A paper out of Arthur Lo’s group (Fernandez, Stein, and Lo 2012) points out that a lot of current big pharma activity is focused on minimizing risk and making incremental improvements in operating efficiency. What surprises me is the low investment in fundamentally improving the yield and efficiency of the early R&D process that feeds pharma’s insatiable need for new drugs. Given the current poor average of ROI on early research that Jack and others (e.g. SSR Health) have pointed out, an often-mentioned solution for the large, research-based pharma firms seems is to cut early research, and to make up for fewer internally developed compounds by purchasing those at or near the Proof of Concept clinical stage from putative “small and nimble” early R&D firms. But if the fundamental process used is inefficient and expected values of R&D projects have low average ROI, no rational firm will be willing to take the risk. And even if “small and nimble” firms do risk investing in research and are smart enough to efficiently develop the new drugs that big pharma can’t, they sure as heck are smart enough to value-price. A respected major pharma CEO recently commented to Reuters that the “Biotech Bubble” in pricing has put mid-size firms out of reach. I’m not sure I know the answer, but I would ask: Are these prices truly a bubble or are they the new normal?

So it’s surprising to me that there is more focus on merger/relocations to minimize tax burdens than there is to fundamentally improve the R&D process. And that the proposed solution seems to have big pharma ceding the R&D field to others, with the best outcome being that their margins shrink, with a possible downside that the industry contracts tremendously.

You mention PM and PV and their ability to bottleneck the process of biopharma development. Can you describe this in more detail/clearly? (It’s very complex and some of us didn’t fully understand it)

Jack:

PM = Predictive Model = any experimental set up (e.g., a depressed rat that you are trying to make less depressed with new antidepressant drugs, or a cancer-derived cell in a petri dish that you are trying to kill with a new anti-cancer drug) that is used to try guess how well drug candidates will work later in the R&D process, when they are tested in sick humans, for example.

PV = Predictive Validity of a Predictive Model. This is the extent to which the results you get from the Predictive Model would agree with the results you would get if you tested a large set of drug candidates in the model and then in real patients. In practice, you can’t often measure PV for sure because it would be too expensive and unethical.

The statistical analysis we have done shows that the chance of making the right decisions in R&D is exquisitely sensitive to PV; much more sensitive than most biologists or chemists working in drug discovery would have guessed. The quantitative strength of the result is not obvious. You only find it out when you do the decision-theoretic math. This then leads to the idea that a decline in the average PV of the predictive models that people are using could go a long way towards explaining why R&D gets less efficient despite big gains in the brute-force efficiency of the individual scientific activities.

In our paper, we express PV as the correlation coefficient that you would find if you ever tested a large number of drugs in the model and then in man. In very rough terms, you would have the same chance of finding a good drug by testing 100 drug candidates (a very small number) in a model whose output was 0.9 correlated with the human outcome, or 1,000 drug candidates in a model whose output was 0.8 correlated, or 10,000 candidates in an 0.7 correlation model, or 100,000 candidates in an 0.6 correlation model, etc. As the model (PM) gets less valid (PV, the compass gets less accurate) you have to search much harder to find a drug that will work (your small island in the big ocean).

This then starts to explain the pattern that we have seen in drug R&D. So, for example, a recent review of failures in the discovery of antibacterial drugs says: “Is it not peculiar that the first useful antibiotic, the sulphanilamide drug prontosil was discovered by Gerhard Domagk in the 1930s from a small screen of available [compounds] (probably no more than several hundred), whereas screens of the current libraries, which include ~10,000,000 compounds overall, have produced nothing at all?”

Our PLOS ONE paper shows that this is not necessarily peculiar. Gerhard Domagk tested his few hundred drug candidates in live mice with bacterial infections. The recent attempts started with single bacterial proteins in a tiny dish. If the results of the mice correlated at an 0.9 level with results in infected humans, and the results of the proteins in a dish correlated at the 0.3 level, or less, with results in infected humans, the success of Domagk and the failure of modern methods is what you would expect. If you look at Figure 4A of our PLOS paper, you can compare the positive predictive value for Domagk’s mice (0.9 on the horizontal axis, 2.5 on the vertical axis) with the positive predictive value for the modern attempts (0.3 or less on the horizontal axis, 7 on the vertical axis).

Jim:

I would point out that a predictive model might be a math construct (in fact, there are many cases where it has been: this is my job) to integrate data or a datum that would not otherwise be very predictive. So a diabetes model that includes thousands of data sources and replicates the effects of diet and many different approved drugs in dozens or hundreds of test cases can be used to leverage the likely clinical implications of a limited amount of preclinical data around a novel target. The PV of a novel assay or omics result may be low (as we point out, they often are), but leveraged with other known biology, in a quantitative way, they yield a higher PV. And because the model is testable, and easily documented, folks have confidence in it. This changes the game.

I think many of us thought that mechanistic modeling fit best in early research, where there is a huge uncertainty gap between the lab and the ultimate pivotal human trial outcome. In concrete terms, the green flash in the microtiter tray may have very low correlation with progression-free survival in an oncology trial. The modeling I’m talking about is one way to leverage that green flash with other data to get a better PV. Recently, it’s become clear that in complex diseases the uncertainty gap can persist, even when we have clinical trial results. So we may have differential tumor and germ-line genomics from a patient with Ewing’s sarcoma, but the individual mutations we identify are hard to correlate with anything actionable. We need to do some pretty heavy math to figure out the evolutionary phylogeny of the tumor cell population, and to account for the fact that the biopsy is of a heterogeneous tumor, and to figure out what the genomic differences mean in terms of modulated metabolic pathways. Oh, and we need to account for the fact that we often rush to treat, and only get samples after the initial treatment has changed the tumor cell population. Integrating all this so that we can understand how treatments affect all relevant “hallmarks of cancer” will give us better PMs and better in silico predictions of the effects of novel treatments. A lot of this thinking derives from the very passionate work of my colleague Dr. James Kress, at KressWorks, by the way.

Your paper mentions a reproducibility crisis of sorts, could you elaborate on that?

Jack:

Drug companies, venture capital firms, and academic scientists have realized over the last decade or so that if you try to repeat biomedical experiments that are reported in the best of the science journals (e.g., nature, cell, science), then much of the time you don’t get the same results. People in industry find that the proportion of academic results that are irreproducible is quite high (some estimates are in the 75% range). Clearly, this is a problem. Of course, people argue about the causes, and about the proportion of results that really are irreproducible, and about what it all means.

If you want a good introduction, one of the most influential papers is by Ioannides and is at: http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

Jim:

I would add one point from a control systems theorist’s point of view, and one from a decision analyst view.
First, it is a proveable conclusion from control theory that if one adds delay to the feedback loop, control systems performance suffers. You know this if you’ve played the race-car simulator games in a pizza parlor. You turn the wheel, and there’s a delay in the car motion shown on the screen. That element makes the game harder.

In managing drug development, stage-gate decisions are critical. The decisions executives make aren’t proven right or wrong for years – sometimes for a decade or more. The “time constant” of people’s careers is shorter than that. So one can’t reward folks using feedback – by the time you could evaluate the decisions they’ve made, they’re often retired, promoted, or have moved to a different firm. Thus, you have to manage folks with surrogate measures that are more timely, but are less correlated with how much their contributions have contributed to actual novel, useful, and approved drugs that have benefitted the firm.

Thus you get the metrics Jack mentions:
How many compounds are screened?
How many candidates were promoted to clinical trials?
Or my favorite: How many papers did the researcher publish?

To use our terminology, none of these metrics has very good predictive validity for the approval of safe, effective new therapies, and the resulting new revenue streams.

The other issue, mentioned in the paper, is that unless we make sham decisions to kill candidates at each stage of drug development (that is, we label some candidates as “killed”, but continue to test all candidates into the clinic) we have a hard time determining actual true and false positive and negative rates.

Since most candidates are failed early and not subsequently tested to see if we were wrong, we have no definitive data to say whether the early fail was a true or false negative. I’m coauthor of a paper out soon describing a promising therapy that was killed and very probably shouldn’t have been. So, a false negative. Then again, several years ago some colleagues worked on an asthma drug that showed strong effects in chimps, but when tested in clinic turned out to not have any significant effects (anti IL-5 MaB as tested on non-eosinophilic asthma). So a false positive. It may be that we use mechanistic, integrative math models to get a better read on TP, TN, FP, and FN rates.

What are your views on clinical trials inclusion/exclusion criterion design and protocol feasibility? How do we do it today versus how should it be done in the future?

Jack:
Most of the very big and very expensive clinical trials for new drugs are funded by the industry, and the industry can only actively promote drugs for the things that regulators, such the FDA in the US, believe have been “proven” by the trials.

This arrangement constrains trials to be a messy and expensive mixture of commercial strategy, marketing, science, medicine, and regulation. Sometimes it works well, but sometimes it works badly. One problem is that older, sicker, and more complex patients are excluded from trials, so that the samples of patients recruited into trials end up looking very different from the patients who end up using the drugs in the real world several years later. Another problem is that companies sometimes avoid doing trials that could give the “wrong” commercial answer, even if such trials would answer important medical questions. Yet another problem is that the industry has been known to cheat in trials (and also messes things up accidentally), so there are huge costs related to data integrity and checking that would be entirely unnecessary if society was less concerned about potential conflicts of interest between those producing the data and those consuming the drugs. And yet another problem is that companies are unlikely to fund trials for products that are unlikely to be profitable.

Given these various problems, I am surprised that the payers (large commercial insurance companies in the US, Medicare, and major European public health systems) don’t run more of their own clinical trials and then base prescribing and reimbursement decisions on their own data. So, for example, drug companies sometimes avoid running trials that provide evidence that drugs A, B, and C are therapeutic substitutes. Once health systems know drugs are substitutes, they can be much more effective in their price negotiations with the drug manufacturers. I think that health systems could often save money, and prescribe better, if they were prepared to fund more trials. It does seem pretty strange that the buyers of drugs have handed over nearly all the responsibility for evidence generation to the sellers of drugs.

Finally on this point, and outside my real area of expertise, I suspect that large clinical trials have been given a privileged position within a “hierarchy” of medical knowledge that they do not necessarily deserve (particularly given the fact they are often designed to meet a messy mixture of commercial, scientific, and regulatory constraints). Folks recruited into trials don’t represent typical patients, but there is a strong tendency to generalize from clinical trials to wider patient populations. Observational studies, which often include representative patients, are sometimes dismissed because they can suffer from bias. This general preference for unbiased results from unrepresentative patients (e.g., phase III trials) over biased results from representative patients (e.g., large observational studies) strikes me as arbitrary.

Jim:
Again, if we can get higher PV markers then we will get better targeting for inclusion and exclusion. I got into this field with a company called Entelos. Entelos was formed after two consultants, Sam Holtzman and Tom Paterson were called on by a sponsor to analyze an approvable decision from the FDA. Here’s my understanding: The sponsor’s trial data for the compound showed that some patients responded well, but others didn’t respond at all. The FDA wanted to know why. Tom and Sam’s modeling showed that there were two different diseases, clearly delineated by a marker. The FDA ended up approving the drug for the effective indication. So the model formed the basis for the approved indication. This approach can be used to find markers for responders and non-responders (or adverse responders) before the trial. So, better inclusion rules.

Can you tell if a trial will fail? How do you do it? We’ve heard very intriguing responses to this question.

Jack:
The total cost of R&D is dominated by the cost of failure. It seems obvious, therefore, that is very hard to know beforehand whether a trial will succeed or fail. If people really could tell the difference, it would not be the case that ~90% of clinical development projects fail.

Jim:
Yes, this is exactly what the case of Pfizer’s GPR119 diabetes target shows. As Tristan Maurer of Pfizer pointed out in a NY Academy of Sciences talk in 2014, Pfizer (and a team of colleagues and I at the firm I used to work at) developed a diabetes model using hundreds of data sources. Keep in mind, this is a model based upon biological mechanism, not just correlations. So if compound A reacts with B to yield C, I can put a rate equation in to represent this, and not just a correlative curve fit, which means that the model can be better at extrapolation. The fact that the parameters in the model represent physiological “things”, like tissue permeability, or preproinsulin stores in pancreatic beta cells, gives us tremendous ability to use data in the literature. So over a 1000 sources were used. And the model was tested both component-wise and in the whole-body sense. This all gave a model with high PV. Pfizer had generated no clinical GPR119 data. Another company had published a very limited set of data for the target, but it had significant gaps (if I recall correctly, there were no glucose data!), which was used more to confirm the correctness of the model than to change it. The model predicted a lowering of HbA1c that was less than clinically and commercially interesting. In fact we predicted a maximum reduction of about 0.4%. Pfizer killed the target on this basis. Later, Johnson and Johnson did a GPR119 trial, and they got a reduction in plasma glucose corresponding to 0.3%, quite consistent with our model. This validated the Pfizer decision, and the savings they realized by avoiding a trial I understand to be in the 20 million dollar range.

What do you make of Ben Goldacre’s push for Failed Trials data to be published? Is there something he’s failing to understand or is his view reasonable but difficult to enforce? (Basically with this question we’re trying to get an informed opinion of Ben Goldacre’s AllTrials campaign)

Jack:

I don’t know the gritty details of the AllTrials campaign. However, it does seem reasonable that the results of clinical trials, both positive and negative, are published in a timely way.

By the way, failure to report trial results does not seem to be a problem that is specific to the drug industry. Publicly funded biomedical research agencies seem to share the problem.

A while ago I made a not-altogether-serious suggestion to the folks at AllTrials that they should use embarrassing “prizes” (like the Razzies for bad films: http://www.razzies.com/ ) for the individuals who run trials that waste the time and good-will of patients.

I suggested one prize for the most duplicative and least informative trial. The “Carbon Copy Prize.” You find that in some therapy areas, inordinate numbers of small and inadequately powered studies are run, all vaguely directed at the same kind of question, but none big enough to provide a definitive answer. Trials that look at the repurposing of a diabetes drug, metformin, as a cancer treatment strike me as a possible example here. Given the total number of patients involved across all the different studies, we really should have a definitive answer by now. I am not sure we do.

I suggested another prize, named after the Harper Lee novel “Go set a Watchman,” for the trial with the most unconscionable delay between completion and publication (to be awarded posthumously).

However, I would come back to an earlier point. Some of the biggest evidential gaps relate to trials that are not done – because it is no-one’s commercial interest to do them – rather than trials which are done and then not reported.

What do you believe is the elephant in the room for the Pharma industry at large?

Jack:

The economics of private sector drug R&D are largely dependent on drug price inflation in the US.

What do you both believe and understand from the data analysis that you wish more researchers knew?

Jack:

As the title in the PLOS paper says, when it comes to drug discovery, “quality beats quantity”.

Jim:
Quality beats quantity, and that the mechanistic modeling discussed can often use low PV atomistic data with other biological knowledge to synthesize a high PV model.

The average consumer and patient doesn’t know how complex the pharma and clinical trials industry is. Many of them see it demonized by the press at large. Do you have any stories, thoughts, or data that you think would be useful for them to take a look at?

Jack:

I spent a chunk of time in 2015 thinking about this. My attempt to answer the question was published in Forbes in an article titled “Four Reasons Drugs are Expensive, of Which Two are False”. Some bits may be hard going, but it has better jokes than the average article on the economics of the drug industry.

Jim:
I have worked with people doing clinical trials and can attest to their dedication and complete commitment to the work. I think many of them are as frustrated with the current R&D paradigm as the general public. It’s easy to criticize an industry in stress, as some of the things they do may not be popular. Hopefully, this paper lowers the stress by encouraging folks to come up with ways to do more effective and higher yield R&D.

Key Takeaways:

The amount spent by the drug industry on R&D per new drug brought to market doubled every 9 years from 1950 to 2010
The brute-force efficiency of drug discovery tools and technologies increased spectacularly over the same period
Scannell & Bosley’s work suggests that R&D efficiency declined because “increasing horsepower” was accompanied by a “worsening compass”; too much research going in the wrong direction with great speed
Judging the quality of the “compass” (i.e., the “validity” of experimental models) is very difficult. However, the drug industry should put its brightest and best people onto the problem of model validity
The commercial attractiveness of private sector drug R&D is worryingly sensitive to US drug price inflation