The problem with metrics is a big problem for AI

mikorym · on Oct 19, 2019

> However, the researchers found that several of the most predictive factors (such as accidental injury, a benign breast lump, or colonoscopy) don’t make sense as risk factors for stroke. So, just what is going on? It turned out that the model was just identifying people who utilize health care a lot.

A good analogous example of this is PCA. If your first component has a dominating effect, then this will drown everything out (you can compensate by looking at components 2 and 3). (Examples: [monetary] inflation, year or month effects.) It's a cool exercise to do PCA on datasets and to see whether things like this pop out. This is also why PCA (which does maximisation) is an explorative analysis and should not be used as an authoritative "result" or "metric". You could, but you have to be careful. Even in picking your components you introduce curation and bias.

Imagine a scenario where we regress back to the dark ages and the eugenics inclined doctor says: "Sorry Stan, your second component value is an outlier, no kids for you; your genes are not considered adequate."

vlz · on Oct 19, 2019

PCA probably means Principal Component Analysis, if anybody wonders.

https://en.m.wikipedia.org/wiki/Principal_component_analysis

EdwardDiego · on Oct 20, 2019

Thanks :)

thinkingkong · on Oct 19, 2019

In lots of organizations sets of metrics aren't "held in tension" and that causes all sorts of wild problems. As a trivial example say you want to stop people from quitting your app. Easy! Just lower the price by 50%. So we often have to set metrics while satisfying all the other constraints. This is the #1 thing I've seen go wrong when teams try to use OKRs or some other metrics based activity; they fail to figure out the related components and either hold those in place or improve them along side their initial goals.

Spooky23 · on Oct 20, 2019

This is a great illustration of the limits of AI — the scope of available data is limited. The data available in medical records is both limited and often garbage.

Without the context of behavioral data and other factors, it’s dangerous to draw conclusions.

In other scenarios, like say predicting the likelihood that a mole is cancerous, AI will eventually be better and more consistent than a human.

mlthoughts2018 · on Oct 20, 2019

What decision making process if not a statistical one would be better when it faces the same data limitations? Or what if we know the best decision making process for a given context, but it’s too expensive to apply?

For example, we may know that evaluating loan applicants is best done with in-person interviews, but it’s simply too costly and delays getting loans to people too long, so we agree to use an automated decision process knowing the cost of it making some mistakes will be part of what is paid for the benefit of being able to process many applications quickly.

mikorym · on Oct 20, 2019

> what decision making process if not a statistical

I believe it's called eye-balling. You'll be suprised how good farmers are at this. It's also why I think tech in agriculture is something of an over-sell; farm managers are already some of the most efficient people you'll meet and even a notebook is sometimes too much to ask for them.

Edit: To answer your question with more elaboration: A statistical model is a proxy for not having full knowledge of the subject matter. But any deep analysis of the subject matter will require personal attention. Farmers (to quote my example again) are specifically good because of their personal knowledge of the farm and the sheer amount of time spent working.

Spooky23 · on Oct 20, 2019

If a predictive process discards risk for humans who lack access to healthcare, it’s creating an obvious ethical problem.

There’s a big difference between writing a loan and missing a stroke.

mlthoughts2018 · on Oct 20, 2019

That seems more like an argument to make more data available. Presumably more people at risk of a stroke can be notified by an algorithmic assessment than could be possible trying to fit them all through actual physician screenings, no? In a public health sense, it would do more good in terms of lives saved, even if algorithmic errors cost lives in some specific cases.

Spooky23 · on Oct 20, 2019

I’m familiar with programs that at scale that have been demonstrated to predict certain negative health/behavior patterns.

From a medical point of view, you can assume a 10-20% probability of a negative outcome. If you layer on other factors or behaviors in the persons life, the probability of a bad outcome approaches 90%.

Problem is, you need deeply intrusive data at scale. What are the consequences of a organization with Google/Facebook capability who also is your AI doctor? It’s easy to foresee complications.

msds · on Oct 19, 2019

Just to piggyback off this, one should be very careful combining techniques with PCA. One error I see a lot in biostatistics is people running statistical significance testing to select variables before running PCA. It's a super cool exercise to do that on datasets, because something WILL show up, and it's very often misleading.

throwaway_bad · on Oct 19, 2019

Metrics are really meant as a social tool for humans.

For example a commonly repeated advice is that you need a "north star" metric. For facebook this was daily active users (for an arbitrary definition of active). For machine translation research this was the BLEU score (which is also fairly arbitrary and flawed).

Humans need these simplistic metrics because we can't agree on progress otherwise.

But it's also because every human has conflicting goals that we know that the guiding metric won't fail too horribly. At all points in time, people are also evaluating progress in terms of their personal values from their point of view. Gaming of the metrics won't go unnoticed by another human. Edge cases can be patched. Conflicting goals are surfaced, decided on and resolved. The guiding metric and real mission of the organization is evolved organically. All of this is done using biological intelligence.

I don't know if it's really possible to build an AI system with the same property. The course correcting of metrics might for a long time remain in the domain of vigilant humans, as we are the only ones who know what our values are.

joe_the_user · on Oct 19, 2019

Metrics are really meant as a social tool for humans.

This is such an odd statement. I think the article makes a good case that in this context, metrics serve as something like the opposite of a tool - a trap, something a person or group uses that takes them something other than where they intended. Now approximation, generally, is something like a tool but it can be used well or badly.

Now, a bureaucrat or someone who want to "just get stuff done" may use a metric to get people moving without caring exactly where. But that's still with a human context, not an AI. In an AI context, a metric can run a plane into the ground or more or less anything.

IE, it's only some of the contexts (racist AI) where metrics have a social content.

The point about corner cases is reasonable but it's easy to by analogy that human correction can't keep with the problems metric yield for complex spaces; just consider how a ten dimensional cube has 2^10 literal corners and AI more or less operate in spaces with many more dimensions (of course not all "corner cases" are literal corners but there's a similar problem of intersecting "areas of concern" or whatever one would want to call them).

throwaway_bad · on Oct 19, 2019

To clarify, I am trying to say that the only purpose of metrics is for humans in a social context. If an AI optimizing for a metric ends up with unintended consequences, that's a human problem. A human has to go in and redefine a new metric to fix it. There's no way other way around it. The human can't anthropomorphize and blame the machine. A machine can't redefine itself because it doesn't understand human values.

So it's important to understand why humans believe in metrics. Humans by default are multi-objective. We don't care about just one or even a few things, we care about a lot of things, all at once, with nobody agreeing on what they are. Identifying the "true direction" we want to go as a collective is impossible. So we pick a good enough direction so everyone can be somewhat aligned. To make it possible to communicate this vision, we compress the world down to few dimensions even though we know that it is absurd. This works fine for humans because our "distributed system" is composed of beings with enough intelligence to notice/alert/convince the rest of the system whenever we are veering off course.

Whenever we program AI with these metrics, we forget that we can't direct them we way we direct our sentient brethrens. Any simple metrics are necessarily doomed to fail because it doesn't have this self aligning property. So ultimately it is the programmer's responsibly to go and realign the system back with human values. If necessary, with some social pressure (hence this article). It doesn't seem like there will be any other way.

rm_-rf_slash · on Oct 19, 2019

Your comment made me think of the Vietnam War. Many of the military leaders in the war came of age during World War II and Korea, where the most obvious metric was land: if Nazis or communists were there, then bad. If allies, then good.

But that completely broke down in Vietnam, when the placid villagers in South Vietnam by day became the ruthless Viet Cong by night. American generals had no idea how to approximate their success like they had in previous wars, so they settled on kill scores, and as long as more Vietnamese died than Americans, the DoD could go on television and claim that America was “winning”, even though the incentives for kill scores resulted in a lot of bystander villages being torched and civilians being killed, which only further put public opinion against the Americans.

I wonder if a sufficiently advanced AI system could correct its use of (or abandon) a bad metric even when it is incentivized to use it.

The problem I cited above was effectively so hard for the people at the time that it was effectively intractable. I’m not arguing that an AI would have had to solve the problem (setting aside the likelihood that even had an AI come up with a better solution other inertial forces like politics and sunk cost would have prevented the solution from being implemented), but any system that would claim to be a “strong” AI, it would at least have to be up to the task of trying.

Spooky23 · on Oct 20, 2019

Vietnam was worse than that. There was no criteria for winning and no visible end, so the conduct of the conflict was insane.

Kill counts were an objective measurement that demonstrated that something was done, and turned into evidence of “victory“.

Intelligence, whether human or artificial, cannot fix problems that cannot be defined.

throwaway2048 · on Oct 19, 2019

The problem is the entire "value system" of an AI is extremely likely to be designed around metrics to begin with.

How could it discard something that it solely values?

AI, even Diamond Hard AI is independent from what it values, intelligence has no bearing on it, consider humans that are extremely intelligent but highly value brutally murdering and torturing other people. No amount of intelligence is going to change the fact they value that.

mlthoughts2018 · on Oct 19, 2019

The problem with discussions like this is that they never provide systematic examples of how a portfolio of metrics or qualitative checking can be integrated into a modeling problem. There’s a lot of finger pointing at metrics and complacency about problems, but the solutions are super vague, like the sanctimonious passage in this article about hiring from under-indexed groups in tech companies and just listening to first-person accounts (which is probably a bad idea if you actually want to help).

Ultimately I agree with the underlying idea, but I think to be helpful you have to present case studies of reproducing research but with metric optimization swapped out for a holistic variety of metrics plus qualitative checking.

I recommend the books Bayesian Data Analysis by Gelman et al and Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill if you want to read good accounts of doing this in practice with real data sets.

There’s definitely room for a book like this that focuses on more domain specific models in NLP, computer vision and deep neural networks.

inimino · on Oct 19, 2019

> There’s a lot of finger pointing at metrics and complacency about problems, but the solutions are super vague

The solution is obvious, and not vague at all: stop over-relying on metrics, and stop pretending that what matters can in most cases be measured.[1] However, I think you dislike this answer (for reasons given in my other reply in this thread) so you are looking for ways to replace bad metrics with better metrics. Which is worthwhile, but not the immediate answer.

> just listening to first-person accounts (which is probably a bad idea if you actually want to help).

I didn't see in the article where anyone suggested only listening to first-person accounts. However my strong belief is that if you don't listen to and seek out first-person accounts, you have almost no chance whatsoever of doing any good in basically any kind of complex social problem, and a high likelihood of doing harm while patting yourself on the back because some metric you settled on is going up.

[1]: Edit to add: This implies not using AI/ML in some areas where it currently is being used. The infamous example of AI grading essays, linked from the article, is one of the most egregious examples of the misuse of ML for something that, to me, is unimaginably, breathtakingly, forehead-slappingly idiotic to give to anyone but people, and not just any people, but the people closest to the students whose work is being graded.

mlthoughts2018 · on Oct 19, 2019

> “The solution is obvious, and not vague at all: stop over-relying on metrics, and stop pretending that what matters can in most cases be measured.”

That is vague. You, like the article, are not being specific or explaining how this can be systematically applied.

inimino · on Oct 20, 2019

"systematically applied"...

I'm not giving you a system, I'm specifically telling you not to rely on systems and rules in cases where they do not work. I'm arguing that people need to take responsibility for the correspondence of their own actions and values. This is not a system, it's an attitude.

If you're using ML to grade essays, then you need to stop doing that, because the things modern ML can measure are not the things that make an essay good or bad.

If you're using metrics to drive a business process, and the output of that process that corresponds to your values cannot be captured by a metric, then you need to stop using metrics and instead design your process according to principles that reflect your values.

wtracy · on Oct 20, 2019

I think the fundamental message here is that there isn't a solution that can be systematically applied.

marmadukester39 · on Oct 19, 2019

I’m not so sure it’s a bad idea to listen to first person accounts from underindexed groups. For one, it may help surface new metrics to readjust with; for another, it may still help to get a sampling of “is this helping or hurting in your opinion” and watch how it trends over time.

mlthoughts2018 · on Oct 19, 2019

I understand what you mean, but the problem is that first person accounts are so idiosyncratic and situation-specific. It takes a lot of discipline to distill only the most well-supported high level themes from a set of anecdotal qualitative data.

The discussion also is at high risk of becoming politicized, where more credence will be given to salient first-person accounts that happen to fit well into some prevalent social narrative of the times.

If you can prove that a lot of care was taken to avoid these types of bias in extracting summaries from a collection of qualitative data, then it can be valuable. But often this means you have to be so cautious and err on the side of rejecting all but the most consistent summaries from the data that you run afoul of critics who want the data to be used to support their preferred political narrative.

This risk of subverting qualitative data like this into a disingenuous claim that something is data-driven is really high, in addition to all the usual cognitive biases that lead people to read what they want into narrative accounts.

inimino · on Oct 19, 2019

> This risk of subverting qualitative data like this into a disingenuous claim that something is data-driven is really high, in addition to all the usual cognitive biases that lead people to read what they want into narrative accounts.

We have engaged on related ideas before[1], and I think your position represents something that I find quite common and deeply unsettling about modern tech culture, but which I don't yet feel I can fully characterize.

There is something about the need for everything to be "data-driven" and metrics-focused that I think represents an abrogation of moral responsibility to form, apply, and maintain our own value systems. It is easier and we would apparently rather assign this messy work to some external process or algorithm or metric, which can then be argued to be perfectly rational, correct, data-driven, unbiased[2].

To me this is a dangerous fantasy, closely related to scientism or the general failure to accept any reality outside of what can be measured, counted, or quantified. This is a characteristic failure (in my opinion) of people in their twenties and of STEM majors in our times, so it's not surprising that it would be a characteristic failure of tech, where both those categories are over-represented.

It also seems correlated with a general ignorance of and arrogance towards the humanities among tech workers.

[1]: https://news.ycombinator.com/item?id=21097061

[2]: A related symptom of this disease is in hiring, where companies come up with systems that look very logical and bias free on paper, for example having "objective" scoring criteria and technical challenges (which end up being almost random number generators, but with some bias that tends to reinforce existing imbalances in the industry, e.g. towards recent graduates), and equally "objective" ways of combining results from hiring committee members or interviewers that tend to create and enforce conformity and exclude divergent viewpoints, while giving the appearance of being bias-free, data-driven, and rational. In fact it's a great way that entire companies fool themselves and come up with terrible processes, and then they whine about how hard it is to maintain diversity (of gender, age, whatever) in their ranks. The people that design these systems aren't doing this as some kind of great conspiracy; as far as I can see, they're not evil, just blind.

perl4ever · on Oct 19, 2019

"an abrogation of moral responsibility to form, apply, and maintain our own value systems."

I think it's a big (and almost certainly invalid) assumption that in the absence of metrics, people "apply and maintain" a consistent value system. It seems to me inconsistent with a belief in the destructive effect of the wrong metrics.

If you think that metrics lead to bad results, it's completely inconsistent to suggest that not having metrics is better. When you have metrics, the harm comes from people ignoring what really should be optimized in favor of what is being measured. If you don't have any metrics, then you're only decreasing the pressure to do whatever may be the right thing.

It's like, no matter how bad the government is, the only solution is better government, not eliminating it.

So, I think your general attitude is ironically capturing the essence of a common intellectual failing of the people you are stereotyping. You know, the glib libertarian tech worker.

inimino · on Oct 20, 2019

> in the absence of metrics, people "apply and maintain" a consistent value system

What on earth makes you think I'm making this assumption? I never said anything of the sort.

I said that people are abrogating their responsibility to do difficult work (figuring out if the ship is going in the right direction or not) by trying to offload this onto metrics, which are theoretically objective and unbiased.

Nowhere did I ever say anything about what people not over-using metrics are doing.

> If you don't have any metrics, then you're only decreasing the pressure to do whatever may be the right thing.

That is not necessarily true at all. Having no metrics at all, just like having no government, means that you must fill the void with something. You must decide for yourself how to act, probably based on a subjective judgement of what the situation is combined with your personal values.

perl4ever · on Oct 20, 2019

"Having no metrics at all, just like having no government, means that you must fill the void with something."

How can one "fill the void" without measuring anything? Personal values may vary, but I can't see the concept of determining the consequences of your actions as being an optional one of them. It seems to me that if you are definitely rejecting that, then you have no personal values. Even if you say you do.

inimino · on Oct 20, 2019

Please be careful with your terminology here.

When we talk about metrics, especially in a business context, we are talking about numerical quantities that can be tracked and unambiguously measured. A metric that can't be measured, or that nobody agrees on the measurement of, is no good as a metric, in this context. Are we agreed on that part of the definition of a metric?

> determining the consequences of your actions

That is an entirely different thing.

> measuring anything

And that is yet another entirely separate concept. You can weigh a course of action without measuring anything.

> It seems to me that if you are definitely rejecting that, then you have no personal values.

In most situations in life, you must make a decision without any chance to know the results of your actions. Can you know in advance the result of picking a career, or taking a job, getting married, moving to another country, or voting for a particular candidate? Of course not. The number of decisions in life that are repeated, and where a numerical, measurable value is available to guide your decision making over time (which is what is meant by tracking metrics in a business context) are a vanishingly small minority of the decisions we must make.

Unless you calculate a utility function before crossing the street, or having breakfast, I don't think you really mean to argue for what you appear to be arguing for.

perl4ever · on Oct 21, 2019

I don't think that we're on the same wavelength and no, we are not at all agreed on what a metric is, but here's a link to a book that might be interesting:

https://www.amazon.com/How-Measure-Anything-Intangibles-Busi...

inimino · on Oct 21, 2019

I don't know your working definition of a metric, but if you think it is coextensive with "determining the consequences of your actions" then I'm sure it's not the right definition in this context. If you think it's necessary to measure before choosing a right course of action in every circumstance, I think you'll find that belief dissolves under a little scrutiny.

From the reviews, the book seems to have some light statistics and introduction to Monte Carlo methods. I don't think it would have much to teach me about statistics or philosophy. It might be useful for someone without a background in statistics who is forced to navigate a business environment where people value metrics above all else, but I'm arguing against such environments.

perl4ever · on Oct 21, 2019

"I don't know your working definition of a metric, but if you think it is coextensive with "determining the consequences of your actions" then I'm sure it's not the right definition in this context"

You wrote "In most situations in life, you must make a decision without any chance to know the results of your actions". So it seems to me you were clearly expressing an idea of what a metric is, in rejecting the need for metrics, that you now attribute to me and say it's the wrong definition.

Unless I'm missing a subtlety like perhaps consequences are not results?

I agree that you probably wouldn't learn anything from that book.

inimino · on Oct 22, 2019

> So it seems to me you were clearly expressing an idea of what a metric is, in rejecting the need for metrics, that you now attribute to me and say it's the wrong definition.

No. You said a bunch of things, that to me are almost entirely unrelated, and I responded to several of them.

One of them was about determining the consequences of actions and I responded to that because it was part of your argument about having personal values.

Before any of that, you wrote: "How can one "fill the void" without measuring anything? Personal values may vary, but I can't see the concept of determining the consequences of your actions as being an optional one of them."

The implication is that if you reject metrics you still have to measure and determine the consequences of your actions, so metrics are unavoidable. If metrics are not the same thing as "determining the consequences of your actions" to you, then you answered your own question of how to "fill the void": you determine the consequences of your actions without metrics.

The question of whether metrics are a good way to determine the consequences of actions is separate from the question of consequentialism, or whether actions should be chosen by looking at consequences alone. You seem to be assuming both and I'm challenging both.

perl4ever · on Oct 26, 2019

"No."

You weren't clearly expressing an idea of what a metric is, gotcha. Very helpful.

dr_dshiv · on Oct 19, 2019

The main thing is to know, deeply, that the metrics are not the goal, but are merely signals that tend to correlate with the goal. Having a clear link between values, goals, and metrics can be really helpful for maintaining the alignment, even when situations change.

dr_dshiv · on Oct 19, 2019

Don't mistake metrics for strategy!

sanxiyn · on Oct 19, 2019

The obvious answer is to learn the metrics itself too. OpenAI does interesting work in this area: "Deep reinforcement learning from human preferences", for example. https://arxiv.org/abs/1706.03741

inimino · on Oct 19, 2019

I'm not sure if this is obvious, if what it means in practice is a metric that may be the wrong metric, but which is visible, being replaced by a metric that is invisible. At least in the first case we can reason about the metric and why optimizing for it might go wrong.

AlexCoventry · on Oct 19, 2019

That's basically what GANs do, too. The recent results are amazing. https://learning-to-paint.github.io

motohagiography · on Oct 19, 2019

Stepping back a bit, I worked on productizing ML and found a basic principle that causes consternation.

The metric is not the ROC curve, the metric is the overall value the algorithm generates in aggregate, based on the unique business situation as a function of risk.

When you look at a trading bot, it is not evaluated on its predictions of market moves, but on how much money it makes - a higher order effect of using the method over time vs. others.

You have to look at the symmetry (or distribution) of risk in the problem it's being applied to. (famously described in a ROC vs. Indifference curves blog post)

In the case of cancer detection, ML is immensely valuable in alerting people who might not otherwise have been tested, but terrible as a diagnostic tool when it is substituted for human judgment. Not because it is inaccurate, but because the consequences of it being wrong are fatal to the patient, where the benefits of it being right are life saving. You need to have a problem with asymmetric upside to benefit from ML. That is, the downside to using it can't be serious, because the machine cannot hold risk or accountability, and no sane organization would expose themselves to catastrophic liability for a probabilistic algorithm being wrong.

This view pisses off managers, executives and investors because most of the ML products out there hide their downside. Luckily, most customers are smart enough to recognize this - which is why AI companies are having such trouble finding PMF.

Self driving cars are another example of a problem with a catastrophic failure mode. You could argue that if we made self driving cars as safe as flying, that would be acceptable. Except the net benefit/upside of a self driving car is not sufficient to justify the downside of its failure mode.

Self-flying vehicles will be more acceptable much sooner than cars because most people don't fly. The upside benefit anchors people to a higher degree of risk appetite than for something they already do. If I were in the self driving car business, I'd switch to the self flying car business for that reason.

From a management perspective, AI/ML does not produce something you can manage. It's not like there are rational levers and incentives you can adjust to achieve outcomes. It's a curve, you tweak weights and try to keep it producing aggregate net benefit. If you are not directly engaged in that process, it's a lottery with outcomes distributed along its a ROC curve. In many ways, it is a substitute for management.

AI/ML is useful for marginal fraud detection, policing of various kinds, and other relationships where you are in a position of power over a distribution of outcomes. It is not something you can be honestly served by.

This is an unpopular view because it bursts the AI/ML bubble, where it has been an excellent source of greater fools for investment, but if you have money in the space, the question to ask is, does the problem this company is solving provide a net benefit that compensates the subjects to this system for its failure mode?

Not how low can you get the MTBF FP/FN rate, that's the hustle. The real question is whether a given individual user or subject of this system can afford the cost of it being wrong. If you don't have that answer, metrics are useless.

MauranKilom · on Oct 20, 2019

> You could argue that if we made self driving cars as safe as flying, that would be acceptable. Except the net benefit/upside of a self driving car is not sufficient to justify the downside of its failure mode.

Aren't you ignoring the human failure modes here? If self-driving cars were safer than human-driven ones, we would definitely be in the positive, and even more so for "safer than flying". There would simply be no downside (in terms of failure mode/rate) to begin with.

I get your overall argument, but for this point you are ignoring precisely what you are advocating for - you are looking at the solution in isolation, not comparing to the alternatives.

motohagiography · on Oct 20, 2019

The crux of my argument is that if self driving cars were safer than human driven ones, their adoption would still be thinner than expected because why would I accept potential fatality and give up my agency to a machine for something I can already do fine myself? (even if just perceived).

You'd have to convince people they were terrible drivers and that driving was super dangerous to get most people to put their kid in a self driving taxi in traffic. I'd agree there is a cultural change effect in play, as cars were farcically unsafe (and yet still 1000x safer than horses) for the first 60 years of their use, but the business model where people switch to self-driving is optimistic.

That isolated individual decision making will be the thing that scales. It's a question of perception of risk/reward, which is very different from statistical risk modelling.

The AI/ML will be great, the products? Whole different class of problem.

MauranKilom · on Oct 21, 2019

> You'd have to convince people they were terrible drivers and that driving was super dangerous to get most people to put their kid in a self driving taxi in traffic.

I don't think so. All things equal, who wouldn't want to spend their time doing something more fun than navigating traffic? Whether it can be affordable to everyone is a different question, but you wouldn't have to insult anyone's driving skills to offer incentives for a switch.

Also, "self-flying cars" have about 100x more problems that affect adoption (aside from having virtually nothing in common with self-driving ones). Having to constantly generate 10 m/s² of upwards acceleration is really expensive. You can't go to work in a flying car. People are huge wimps about crashing. Basic stuff like this. I mean, I can't/won't stop you leaving the self-driving car business for the self-flying car business, but it seems incredibly ill-advised to me.

sanxiyn · on Oct 20, 2019

> You need to have a problem with asymmetric upside to benefit from ML.

This is very insightful. I have a friend working on making audio searchable by transcription using ML. I immediately thought that was a good idea unlike many other ML ideas, and "asymmetric upside" is a good summary why.

minimaxir · on Oct 19, 2019

Metrics will always be gamed as long as a) there is an incentive to game them and b) there are no proportionate consequences for gaming said metrics.

To use YouTube video recommendations as an example, (a) is reflected by the current title/thumbnail brinksmanship seen recently as well as the shift to "alt-right" topics, and (b) is reflected through YouTube contorting itself trying to justify not punishing such methods. In the case of recommendation rabbit holes (e.g. alt-right rabbit holes), it's why journalism highlighting the trends is especially important as a correcting force.

m0zg · on Oct 19, 2019

The fundamental problem with academic metrics is that if you're solving a _real_ problem (a rarity in academia, BTW), they are merely a proxy for how well you've solved the problem, and often not a good one.

Case in point: say you're building a berry picking robot using computer vision. As a part of that you'll probably use an object detection system which lets your robot see the berries and know where they are. Commonly you will use a combination of losses to optimize the system, and a combination of mean average precision metrics to evaluate how good it is. But here's the issue: even the evaluation mAP (let alone loss) does not tell you how good the robot will be at picking berries. Moreover, there's no point of reference for you to tell if e.g. mAP of 80% is "good enough". And "goodness of the robot" while it can be defined as a real world metric (e.g. robot successfully picked 90% of ripe berries and only destroyed 0.5% of the plants), usually can't be formulated as a direct optimization objective. So you end up futzing with the metrics that are easier to work with, hoping and praying that your result is good in the end.

If that doesn't sound scary enough, think of autonomous driving systems. :-)

XuMiao · on Oct 20, 2019

Most of the time, the surrogate metrics align with the unknown metrics when they are small, e.g. 80%. They can start to diverge at 90% for example. When I see people report a .5% improvement over a 98% accuracy and claim the statistical significance, I highly doubt about the real significance. The long tail is always longer than we thought. On the hand, there is no coincidence in the world. If some datapoints happen to be outliers that the metrics miscalculated, it might be that the problem isn't setup right. For example, missed some information in the data. In-person interview helps not just because people have a better judgement. It's because we can find out more details about that sample. The current AI/ML method lacks an active information seeking process. They don't have the freedom to explore the world beyond the defined state/reward space.