WEBVTT
00:05.000 --> 00:05.690
Can part.
00:07.220 --> 00:14.809
Olgica Milenkovic: Thank you, sorry I keep forgetting to press the recording that's it, so this is the second part of today's program and the final talk.
00:15.380 --> 00:24.380
Olgica Milenkovic: And it will be delivered by Vitaly feldman who is right now at apple and just a quick introduction to be Italian his work.
00:25.130 --> 00:32.570
Olgica Milenkovic: you've probably all heard of him he's a pretty well known figure in foundations with machine learning and his recent focuses.
00:32.960 --> 00:55.550
Olgica Milenkovic: on privacy, preserving optimization learning and general data analysis yeah he got his PhD from Harvard in 2006 he worked with Leslie valiant and notable things to point out our that he received several cold best paper awards student paperwork so and today he's going to talk on.
00:56.780 --> 01:11.600
Olgica Milenkovic: chasing the long tail and I saw equally interesting titles short tail about the long tail or something like that, so it sounds like a very interesting talk on what neural networks memorize and why and we're all looking forward to your presentation Vitaly so please go ahead.
01:12.980 --> 01:17.900
Vitaly Feldman: Man Thank you and thank you to all the organizers for inviting me for.
01:18.470 --> 01:22.850
Vitaly Feldman: I guess I wish I could be in Chicago seeing y'all, but it is what it is.
01:23.870 --> 01:36.050
Vitaly Feldman: So I am going to talk about the role of memorization in machine learning and the talk is based on two papers, one of which is joined with my colleague john, who is a global research.
01:37.370 --> 01:38.060
Vitaly Feldman: So.
01:41.090 --> 01:51.530
Vitaly Feldman: So the starting point of this work is the well known observation that deep learning algorithms tend to fit the training data very well, even when the test error is quite high.
01:52.460 --> 02:00.500
Vitaly Feldman: Of course, so at this point, we got used to this behavior of deep learning, but I was actually very surprised when I first heard about it.
02:01.370 --> 02:18.590
Vitaly Feldman: And the reason I think this is the present is that if you think about it, the only way for learning algorithm to fit training data so well, even when through accuracy is relatively low, is to effectively memorize a large fraction of the training, training labels.
02:19.670 --> 02:33.920
Vitaly Feldman: This seems to contradict the standard attrition about learning, since we think of learning as carefully balancing some notion of capacity and fit to data, so as to avoid or fitting to these outliers and mislabeled examples.
02:35.330 --> 02:39.170
Vitaly Feldman: Of course, now we have much better empirical evidence of such behavior.
02:39.740 --> 02:51.230
Vitaly Feldman: And in their famous experiments Michael Burry calibrate their children and his colleagues show that standard deep learning algorithms fit well even completely randomly a label data sets.
02:52.010 --> 03:01.130
Vitaly Feldman: For example, standard trading algorithms fit over 90% of the completely randomly labeled image in the data set which has over a million of examples.
03:02.570 --> 03:12.920
Vitaly Feldman: shine and colleagues also point out that this behavior is not well explained by existing theoretical analysis of generalization and the goal of this work is to explain this phenomena.
03:14.120 --> 03:18.890
Vitaly Feldman: But before we can hope to explain something we need to formally define what we're trying to explain.
03:20.000 --> 03:25.130
Vitaly Feldman: Specifically, would like to formalize memorization of the labels of training examples.
03:26.840 --> 03:27.290
Vitaly Feldman: willing.
03:28.340 --> 03:34.010
Olgica Milenkovic: Just sorry yeah one quick question from her over to you mean accuracy not error right.
03:34.940 --> 03:42.680
Vitaly Feldman: Probably if I say Okay, in the previous yes, you probably got it right, I don't remember what I said.
03:44.630 --> 03:45.770
Vitaly Feldman: Whatever makes sense.
03:47.480 --> 03:47.990
Vitaly Feldman: So.
03:49.250 --> 04:00.020
Vitaly Feldman: Going back to the definition will formalize it using the following definition which intuitively says that the label of a training example is memorized by learning algorithm.
04:00.380 --> 04:07.820
Vitaly Feldman: Is the learning algorithm cannot predict the label of the example, based on the rest of the data set, but it still fits the label.
04:08.600 --> 04:10.250
Vitaly Feldman: For me, oh yes.
04:11.480 --> 04:16.010
Olgica Milenkovic: there's nothing on the screen I so i'm sorry for interrupting I thought, something was off with this light sorry.
04:16.340 --> 04:16.910
Vitaly Feldman: Go ahead, oh no.
04:17.420 --> 04:19.040
Vitaly Feldman: yeah okay sorry I was just preparing.
04:20.540 --> 04:21.140
Vitaly Feldman: Preparing for the.
04:22.790 --> 04:23.840
Vitaly Feldman: For the more formal definition.
04:26.000 --> 04:35.060
Vitaly Feldman: For data set as that consists of and label examples and index I out of an.
04:36.830 --> 04:48.320
Vitaly Feldman: Order one example and they're learning algorithm eight we measured them and what ization of the ice example relative to this specific data set as the difference between two terms, the first term.
04:49.670 --> 05:09.230
Vitaly Feldman: Measures how well the algorithm hits the ice example where the probabilities so necessary since most learning algorithms are randomized and the second term measures, how well the learning algorithm predicts the label of the ice example when the example is removed from the data set.
05:10.370 --> 05:14.540
Vitaly Feldman: In other words, how hard it is to predict the label without observing.
05:15.950 --> 05:25.700
Vitaly Feldman: So and informally will say that an example is memorized if it's memorization value the difference between these two terms is larger than some fixed positive constant such as one half.
05:26.990 --> 05:31.940
Vitaly Feldman: This definition also has a nice property that it can be directly related to the generalization gap.
05:33.020 --> 05:43.940
Vitaly Feldman: Specifically, for any distribution fee or labeled examples and the data set as drawn ID from this distribution the expectation of the average memorization value.
05:44.540 --> 05:57.080
Vitaly Feldman: For all the examples is basically equal to the expected generalization gap, which is just them, the difference between the in the true error and the empirical error.
05:58.100 --> 06:09.020
Vitaly Feldman: So all algorithms for which the generation gap is large, some large constants are effectively memorizing a significant fraction of labels, according to this definition of memorization.
06:10.970 --> 06:13.670
Vitaly Feldman: And the question we are addressing this work is whether.
06:14.840 --> 06:22.310
Vitaly Feldman: Why do we see so much of this memorization and whether memorization and the resulting large innovation gap our fundamental aspects of learning.
06:22.820 --> 06:36.650
Vitaly Feldman: And it says that they are necessary for achieving optimal accuracy and here's The alternative is that this is just some sort of side effect of of the learning algorithms that we're using and and it's not really yeah it's not really necessary for learning.
06:39.440 --> 06:46.520
Vitaly Feldman: So let me know briefly explain why some of the standard approaches to understanding of transition, do not explain this phenomenon.
06:47.570 --> 06:58.400
Vitaly Feldman: And the standard approach, as you probably all know is based on expressing digitization error over an algorithm which gets as an input data set and outputs.
06:59.480 --> 07:08.300
Vitaly Feldman: Say model, a chap we express this generation errors, the some of the terms the empirical error and the generation gap.
07:09.530 --> 07:25.010
Vitaly Feldman: And the worst case value of of the generation gap is controlled yes i'm kind of proxies that just notions of capacity I mentioned margin norms also stability and media and our information theoretic notions and so on, so forth.
07:26.510 --> 07:30.410
Vitaly Feldman: Your your most of you know quite a bit about those.
07:32.180 --> 07:45.680
Vitaly Feldman: And this type of analysis, when you also ventilation ization error in this way suggested learning algorithms would be optimizing the some of the proxy so the optimal the ideal outcome would be optimizing the sun and the proxy.
07:47.120 --> 07:52.100
Vitaly Feldman: Some of this proxy and the empirical error of the algorithm.
07:53.480 --> 08:00.890
Vitaly Feldman: Now intuitively it's not hard to see that, for every label that is memorized the algorithm needs to increase the complexity of the model.
08:01.850 --> 08:13.730
Vitaly Feldman: In the worst case, they memorize this number is labeled can be incorrect or the example, could be some kind of useless outlier so in the worst case, the memorization we know will not be helpful.
08:14.900 --> 08:27.500
Vitaly Feldman: Moreover, you know proxies that if if i'm aware of memorization of a single able to request, increasing the value of the roxy by much more often by something like one over X squared away and then what we gain in the empirical error.
08:28.070 --> 08:36.950
Vitaly Feldman: So the resulting theoretical analysis that we get effectively contradicts the practices so just that what they're doing when when we are worth memorizing it's doing something suboptimal.
08:39.080 --> 08:43.430
Vitaly Feldman: And and effectively, the least that album should not be working like this.
08:44.450 --> 08:49.190
Vitaly Feldman: Of course, at this point, this classical approach is not the only tool that we have for analysis of generalization.
08:51.230 --> 08:56.930
Vitaly Feldman: And, in fact, since the work of genre and others, there have been quite a few attempts to address.
08:58.160 --> 09:10.190
Vitaly Feldman: The various limitations of the standard approach and one line of work that I think is worth mentioning is understanding observation properties of interpolating methods and interpolating method is.
09:11.150 --> 09:16.310
Vitaly Feldman: An algorithm that fits all the training labels perfectly and you've heard the.
09:16.970 --> 09:28.970
Vitaly Feldman: term early in this workshop by Peter on such methods so such algorithms have been cited since the classical work of cover and heart than one nearest neighbor which is kind of the prototypical interpolating algorithm.
09:29.540 --> 09:35.870
Vitaly Feldman: But has been a lot more work on this topic, specifically motivated by by this phenomena.
09:36.530 --> 09:50.180
Vitaly Feldman: and a high level these works show that interpolation can be harmless, specifically in certain problems settings into breaking algorithms achieve a simplistically the same generalization guarantees as the best non interpolating.
09:50.750 --> 10:00.950
Vitaly Feldman: algorithm that we know, so these words show that it's possible to generalize well interpolating but they still do not explain why interpolation is common in the first place.
10:02.660 --> 10:17.930
Vitaly Feldman: So, and in this context that a common informal justification for interpolation is that it makes optimization easier, however, however, as far as I know, there is no theoretical practical evidence that finding and interpolating solution is actually easier than not interpolating once.
10:19.340 --> 10:19.820
Vitaly Feldman: Let me.
10:20.990 --> 10:23.030
Vitaly Feldman: also point out that label memorization.
10:25.670 --> 10:36.230
Vitaly Feldman: also happens without interpolation, for example in the image net experiments that I mentioned before the training error was 9% wasn't an interpolating solution.
10:38.330 --> 10:43.610
Vitaly Feldman: So and other relate that line of work like work that is motivated by.
10:44.960 --> 10:53.720
Vitaly Feldman: This phenomena is work that shows that, over and over parameter ization is useful, and there are.
10:54.860 --> 11:00.470
Vitaly Feldman: Various benefits known benefits to our prime position, for example, we know that it can improve.
11:02.000 --> 11:09.800
Vitaly Feldman: It can make optimization easier and also lets the learning algorithm explorer which are self models, however, just because the.
11:10.610 --> 11:21.710
Vitaly Feldman: algorithm can fit all the data does not mean that it should, so the benefit organization does not by again by itself does not explain the presence of memorization and interpolation.
11:23.930 --> 11:28.940
Vitaly Feldman: So so hope this gives you gives you a bit of a background, on the question wanted answer.
11:29.630 --> 11:38.030
Vitaly Feldman: And at that high level, the answer that we given this work is that label memorization is actually a fundamental aspect of learning from natural data distributions.
11:38.720 --> 11:55.820
Vitaly Feldman: Specifically distributions that are long tailed, in other words we show that memorization is necessary for children close to optimal generalization when data follows these natural beta distributions and in this work will give both theoretical models and empirical evidence for this.
11:57.140 --> 12:05.540
Vitaly Feldman: stuff and, but let me sort of start with a bit of intuition on on on why, on this on this deal.
12:07.130 --> 12:07.700
Vitaly Feldman: So.
12:09.710 --> 12:20.930
Vitaly Feldman: So and and let us start with the following for actual examples of attract class from the CFR 10 training data set they're all labeled as trump.
12:21.950 --> 12:34.550
Vitaly Feldman: what's special about these four examples is that none of these examples would be classified as truck if you work to remove them from the from the decks advisor against standard learning algorithms.
12:35.450 --> 12:41.870
Vitaly Feldman: So, as we discussed by our definitional learning algorithm would need, we need to memorize their labels to fit.
12:42.740 --> 12:53.330
Vitaly Feldman: These examples and the question is whether this memorizing this example is useful and just looking at this example I imagine the dinette, the answer is not obvious to you.
12:54.320 --> 13:10.280
Vitaly Feldman: But using the tools that i'll describe later, we were able to measure usefulness of training examples and i'm gonna talk more about this later and it turns out that memorizing the first example attaining a example the top one and is useful if we were to remove this example.
13:11.720 --> 13:22.010
Vitaly Feldman: If we were to remove this example from the data set that would significantly decreased the accuracy of the resulting model when classifying the example.
13:23.060 --> 13:25.430
Vitaly Feldman: On the from the tests that that you see on the right.
13:27.440 --> 13:34.490
Vitaly Feldman: At the same time and it's not a huge surprise that the second example training example from the from the top.
13:35.720 --> 13:53.150
Vitaly Feldman: does not improve improve accuracy on track so it's effectively it's useless for memorization Similarly, the third example is actually useful sensitive improves accuracy and the example on the right and tested that you see here and the fourth example again appears to be useless.
13:54.770 --> 14:10.550
Vitaly Feldman: So we see that as a whole memorization was useful we memorize some useful examples, but what is even more interesting is that without access to fresh samples from the distribution, it might be impossible to to tell which of those example will turn out to be useful.
14:11.690 --> 14:26.000
Vitaly Feldman: Basically, they are all kind of very different from all the other examples of the truck so if everything you know about trucks, is what you see in this data set you won't be able to tell which of those is actually useful.
14:27.230 --> 14:37.820
Vitaly Feldman: And in order to get and I guess the point here is that, in order to get the benefit of memorizing a useful example the learning algorithm also needs to memorize the useless ones.
14:39.110 --> 14:50.090
Vitaly Feldman: And this is also exactly why the resulting model and ends up having such a large a generation gap it kind of like I would have memorized this a lot of useless examples.
14:51.110 --> 14:58.280
Vitaly Feldman: Which sort of again make the the get large just to make sure that they get the benefits of memorizing the useful ones.
14:59.570 --> 15:07.130
Vitaly Feldman: And you can also think of it as an international the reason why our sort of standard approach to education does not explain memorization.
15:07.790 --> 15:18.680
Vitaly Feldman: There, based on the bounds that estimate the worst case sort of cost benefit analysis of imagination and, in the worst case either memorization we know is is useless.
15:20.420 --> 15:31.730
Vitaly Feldman: So let me now briefly explain where I think these hard on a typical examples that that I look at come from kind of and.
15:33.410 --> 15:39.980
Vitaly Feldman: And I think that's a and this explanation is based on observation that natural data distributions are not sort of.
15:40.790 --> 15:50.720
Vitaly Feldman: single homogeneous Gallo gaussian distribution that we like to analyze that they are better thought of US consisting of numerous numerous subpopulations or clusters.
15:51.170 --> 15:59.960
Vitaly Feldman: That are quite different from each other, and for this let's look, for example, at the class and the examples of a class of birds in the CFR 10 data sense.
16:00.680 --> 16:11.180
Vitaly Feldman: And this example can contest images from hundreds of birth speeches, taken from a variety of angles and different magnification is big grounds and so on, so forth, so forth, these.
16:12.800 --> 16:27.500
Vitaly Feldman: solutions are quite different from each other so learning Gunther that only has 5000 examples of birds may be unable to accurately classify as a population without observing any representative from that subpopulation.
16:28.880 --> 16:42.080
Vitaly Feldman: And because there are so many stipulations some of those stipulations will be quite rare and have frequency, which is on the order of one over N, where is the number of phone the size of the data Center.
16:43.730 --> 16:51.110
Vitaly Feldman: These stipulations will be strictly are actually useful for improving accuracy, but the data set will usually include just a single example.
16:51.980 --> 17:04.670
Vitaly Feldman: From those two populations and at the same time outliers which are examples from extremely rare subpopulations sessions, whatever not birds at all, also have just one example of PR subpopulations.
17:05.750 --> 17:18.200
Vitaly Feldman: So for a subpopulation with just a single REP if you observe just single REP it is impossible to tell whether the stipulation is useful about rare or it is an outlier so it's not useful.
17:19.550 --> 17:36.020
Vitaly Feldman: In addition, when you only have a single representative from Sam separation, it may be impossible for the algorithm to distinguish a correct label of that representative from a noisy one just because there is again very low it's a it's a unique representative of the entire population.
17:37.610 --> 17:47.330
Vitaly Feldman: So they only remaining question is where did he so as we can see that these these kind of rare examples are the reasons why we might need memorization.
17:47.930 --> 17:59.420
Vitaly Feldman: And the only question is whether these kind of low frequency subpopulations are important for achieving high accuracy, or perhaps can just being in the word in the in the learning process there after all quite rare.
18:00.650 --> 18:20.030
Vitaly Feldman: So the answer clearly depends on the distribution of frequencies of the population and informally, it is well known to many practitioners that natural beta is long tailed and what they mean by this is that rare and the typical cases make up a significant fraction of the distribution.
18:21.290 --> 18:32.540
Vitaly Feldman: There are also several works that give empirical evidence of such long tail behavior, for example, shoe and others use additional annotations that are given in the Su n vision.
18:33.290 --> 18:48.470
Vitaly Feldman: pattern recognition data set to applaud the number of examples for every visibility pattern over the classified object so for every object, we have additional location, what is, what is the sort of direction or an angle from which this object is visible and they clock.
18:49.490 --> 18:52.940
Vitaly Feldman: or the number of example per visibility pattern.
18:54.470 --> 18:58.310
Vitaly Feldman: That they bloody on the log log scale.
18:59.540 --> 19:07.700
Vitaly Feldman: In this kind of top right corner of each of the plots and, as you can see, this plot is close to linear with flow being minus one.
19:08.780 --> 19:09.380
Vitaly Feldman: and
19:10.580 --> 19:26.000
Vitaly Feldman: Then minus one slope is exactly the zip distribution, which is the prototypical longtail distribution famously followed by words in several languages in this distribution the frequency of the case most frequent item is proportional to one over key.
19:27.260 --> 19:32.510
Vitaly Feldman: Now is the number of subpopulations the total number of population T is larger than the number of samples and.
19:32.900 --> 19:43.940
Vitaly Feldman: Then the fraction of the data distribution made up by subpopulations with frequencies that are on the order of one over and these kind of rare but useful is actually roughly one over lunch T.
19:45.230 --> 19:51.890
Vitaly Feldman: So if the number was he is not that large, this will be a significant fraction of the entire population.
19:53.120 --> 19:53.690
Vitaly Feldman: So.
19:55.460 --> 20:02.870
Vitaly Feldman: In particular, this way, it is significant enough to to be important for achieving high accuracy when working.
20:05.390 --> 20:13.880
Vitaly Feldman: So now, so I hope this intuition roughly clear, let me now give you a rough idea how this intuition can be formalized.
20:14.750 --> 20:22.010
Vitaly Feldman: And for this purpose i'll describe a fairly simple model that isolates the phenomenon that we are interested in from other modeling issues.
20:22.730 --> 20:33.590
Vitaly Feldman: And i'll then maybe briefly explain how the model temperature is the more realistic settings and you can find more details in the paper, so in that so in our model.
20:34.940 --> 20:44.720
Vitaly Feldman: will have a discrete domain acts of sys team and each eligible, where each element in this domain, will represent the entire population in the more general model.
20:47.150 --> 20:56.180
Vitaly Feldman: So will describe a learning problem, using a Meta distribution fight over distribution on labeled examples.
20:56.660 --> 21:06.500
Vitaly Feldman: And the goal of the learning algorithm will be to minimize the expected utilization error of the model if my thought it outputs where the expectation is.
21:07.460 --> 21:10.880
Vitaly Feldman: So we've all will be to minimize this error, which is the expected.
21:11.840 --> 21:27.950
Vitaly Feldman: generation air where expectation is taken, of course, as usual, with respect to the choice of the data set from the beta distribution the output, the model that is out, but, but also with respect to the choice of the data distribution from the Meta distribution over data distributions.
21:29.870 --> 21:32.060
Vitaly Feldman: So, so far, nothing particularly special.
21:33.560 --> 21:39.950
Vitaly Feldman: we're just looking at the expected nation error, instead of the worst case over a set of data distributions which is more common in standard models.
21:40.460 --> 21:50.030
Vitaly Feldman: The main new element in our model is that, instead of assuming that nothing is known about the data distribution will assume that frequencies of elements in X have a certain shape.
21:50.840 --> 22:01.910
Vitaly Feldman: And, in other words will have some prior distribution pi over frequencies, which can be specified, as the list of frequencies over the entire elements, or just distribution over frequencies equivalent Lee.
22:03.860 --> 22:07.340
Vitaly Feldman: And in addition will assume that, for every point.
22:08.630 --> 22:13.160
Vitaly Feldman: That we observed the only information with the algorithm has about the true frequency frequency of the point.
22:13.460 --> 22:22.190
Vitaly Feldman: comes from the number of times, it was observed in today's just from the circle information that we have we don't know ability which which element has which of those.
22:22.850 --> 22:31.850
Vitaly Feldman: frequencies and to express this will define the following way to generate the distribution over X over the domain X for every point in the domain.
22:32.300 --> 22:45.080
Vitaly Feldman: Little X will its frequency randomly and independently from this prior over frequencies and then because we want to have a distribution in the end will normalize the everything to some to one.
22:46.940 --> 22:53.870
Vitaly Feldman: In addition, in this simple as setting will assume that big some distribution F over labeling functions.
22:55.370 --> 23:01.130
Vitaly Feldman: In this case, it's been taking the distribution versus worst cases less important, because the most important.
23:01.730 --> 23:09.860
Vitaly Feldman: part is happening with the distribution over the marginal distribution over the points, but it will be more convenient to phils assume that there is some kind of arbitrary distribution or functions.
23:11.360 --> 23:14.000
Vitaly Feldman: labeling functions and.
23:16.100 --> 23:25.940
Vitaly Feldman: will define the sort of this mega distribution five for a prior prior distribution fight over over frequencies and the distribution capital F over.
23:26.720 --> 23:47.270
Vitaly Feldman: labeling function in the following way will pick a marginal distribution D, according to the over the points, according to the process, I have just described and also pick a labeling function from the distribution capital F independently independently and then.
23:48.650 --> 23:59.090
Vitaly Feldman: define the data distribution to be the data distribution, which is a thing by drawing the point from the marginal distribution D and labeling and by function as that we have drawn so this.
24:00.230 --> 24:02.750
Vitaly Feldman: defines this kind of a learning problem.
24:03.920 --> 24:12.440
Vitaly Feldman: And for this learning problem will prove the following general results for basically for every learning problem expressed in this way.
24:13.490 --> 24:14.600
Vitaly Feldman: will show that.
24:15.980 --> 24:22.700
Vitaly Feldman: Any algorithm that does not fit the examples that appear once in the dataset like points which are kind of unique.
24:24.590 --> 24:29.150
Vitaly Feldman: will be suboptimal specifically a special case of with the IBM wipro is the following.
24:30.290 --> 24:31.490
Vitaly Feldman: For every algorithm.
24:32.600 --> 24:47.000
Vitaly Feldman: A it's expected generalization after this is this in terms of the left is at least the optimal error, where the optimum is just the minimum over old algorithms of this expected generalization error.
24:48.260 --> 24:49.070
Vitaly Feldman: Plus.
24:50.570 --> 24:54.800
Vitaly Feldman: a term that measures, the expected number of of.
24:56.780 --> 24:59.180
Vitaly Feldman: of errors that are made on.
25:00.500 --> 25:08.060
Vitaly Feldman: example examples that appear once in the dataset, which is exactly this error s over term but so So these are example that.
25:09.590 --> 25:14.900
Vitaly Feldman: That appear once in the data set and the model output by the algorithm does not fit those examples.
25:15.650 --> 25:26.810
Vitaly Feldman: So the cost of not fitting basically so so basically showed that the algorithm is suboptimal these for every time it doesn't fit the example it's suboptimal by certain factor towel, and this.
25:28.250 --> 25:45.770
Vitaly Feldman: Factor TAO is defined here it's not so important, what exactly this expression against what's in general, the point is that depends only on the number of examples and and the frequency prior and it is not hard to show that.
25:46.940 --> 25:56.270
Vitaly Feldman: This factor tau is essentially determined by the total weight of the frequencies on the order of one over n in the frequency prior.
25:57.470 --> 26:06.950
Vitaly Feldman: And what this means is that for long tail distributions this price is high, so all examples that during wants needs to be fit to get close to optimization error.
26:07.820 --> 26:18.170
Vitaly Feldman: And I guess the the surprising part here is that only this means only example, including those examples which are actually who's actual frequencies could be zero and they're actually outliers.
26:19.370 --> 26:28.310
Vitaly Feldman: Let me basically make two quick additional points so first of all in this result, the models that are output by the algorithm.
26:30.680 --> 26:41.570
Vitaly Feldman: Industry distribution half over the labeling functions play no role, we are only talking about the distribution of prayer over frequencies, so the models and the classical functions can be arbitrarily complex.
26:42.410 --> 26:53.480
Vitaly Feldman: This is kind of different from the classical theatre in which, in order to relate empirical error to population error will need to read will usually require some kind of bound on the complexity of models.
26:54.740 --> 27:03.710
Vitaly Feldman: Here only balding some of diligent and the second note is that it's actually a relatively easy to extend this result label noise.
27:04.070 --> 27:16.880
Vitaly Feldman: it's again not as important phenomenon here, and if variant of this data inputs, as long as the observed label, are the most are the ones which are most likely to be true for these unique.
27:17.960 --> 27:18.830
Vitaly Feldman: unique examples.
27:21.770 --> 27:24.440
Vitaly Feldman: The proof of this is actually not not very hard.
27:25.550 --> 27:34.280
Vitaly Feldman: It exactly captures the fact that, if we observe it point once and our prior own frequencies has has frequencies on the order of one over N, that it is likely that the point.
27:34.790 --> 27:49.580
Vitaly Feldman: It is somewhat likely that the point has frequency in such a range and therefore it needs to be fit you know the to not be suboptimal instead of giving you more detail, let me give you a simple special case of this kind of prayer or frequency prior and.
27:51.080 --> 27:55.730
Vitaly Feldman: And what our CRM gifts and in this special case, our prior frequencies.
27:56.750 --> 28:02.660
Vitaly Feldman: People they're just two types of points, the first type is rare, but useful.
28:03.770 --> 28:12.410
Vitaly Feldman: Points and their frequency is equal to one over N, and the second type are outliers which have frequency of one over n squared.
28:12.950 --> 28:23.810
Vitaly Feldman: And we'll have an over two elements, so the first type and then squared over two elements of the second type, so the total weight of elements of both types is the same is an is one half.
28:24.800 --> 28:30.080
Vitaly Feldman: This is sort of clearly and artificial frequency distribution, but it makes understanding the results much easier because.
28:30.920 --> 28:43.400
Vitaly Feldman: distinction between these two types of examples so first of all it's not hard to see that in a typical data set sample from a typical data distribution chosen, according to this frequency prior.
28:43.910 --> 28:54.500
Vitaly Feldman: Prior will have about 50% outlier points where an almost all of them will be unique, this is just because half of those divisions are these outlier points.
28:54.980 --> 29:04.640
Vitaly Feldman: And addition a bit of mouth shows that it the difficulty, I said we'll also have about 30% of unique points that are actually useful rare examples.
29:06.140 --> 29:15.980
Vitaly Feldman: And the only information that the learning algorithm has about the true frequency of a point is its empirical frequency, as you can see just from the symmetry so there's no way for an algorithm to distinguish between the.
29:16.610 --> 29:29.420
Vitaly Feldman: Two types of unique points and particular every time any learning algorithms does not fit some unique point there's about three eight chance that it is actually a useful point and not an outlier.
29:30.680 --> 29:36.920
Vitaly Feldman: So the algorithm will be suboptimal by the frequency of that useful white, which is one over n.
29:38.150 --> 29:55.490
Vitaly Feldman: And it's kind of it's just easy to see just maybe scam you rustic argument, and indeed the computation of the cost out in the theorem that I just gave will give that towel is actually approximately whatever this point 3.8 and which is roughly what 0.375 and.
29:57.200 --> 30:12.020
Vitaly Feldman: And also, maybe worth noting that actually a lot of unique point the points at this example about 80% and learning alpha which does not fit any of those because principle cannot memorize them will be suboptimal by at least 30%.
30:13.700 --> 30:18.500
Vitaly Feldman: Anyway, hope this quick example gives a bit of the intuition about sort of what's happening.
30:20.120 --> 30:22.310
Vitaly Feldman: And let me know if you have any questions at this point.
30:27.170 --> 30:31.970
Vitaly Feldman: Anyway, no questions so far was there'll be time for more questions well let's.
30:33.920 --> 30:50.060
Rebecca Willett: Have a quick question, yes, so um so if we go back to, for instance, the the CFR example that you were talking about where you were kidding potentially really low 9% training error and 99% test error.
30:51.200 --> 31:02.150
Rebecca Willett: Because of these high tales like is that roughly on order with this intuition that tau is on the order of point 375 over in like I say, are these scaling similarly or.
31:02.810 --> 31:13.040
Vitaly Feldman: So so well this example is certainly one give any anything similar yeah this example is completely artificial in terms of we of course we don't expect.
31:13.400 --> 31:26.510
Vitaly Feldman: The distribution to contain only rare useful points and outliers one should actually to get a better sense of things, one should probably look at something like zip distribution that have a better sense.
31:27.110 --> 31:37.400
Vitaly Feldman: And, and even then, of course, there are many there in in any actually switch and things are in various ways Messier here I sort of assume that.
31:38.090 --> 31:44.660
Vitaly Feldman: That you when you don't see it when you really don't know anything about the point, maybe more typically.
31:45.530 --> 31:52.820
Vitaly Feldman: Even in for very hard points you would probably have a better signal still somewhat similar to some other example and things are.
31:53.270 --> 32:02.300
Vitaly Feldman: Numerical things a little more complicated we'll see actually some empirical experiments later and there will be some kind of yeah there will try to remember some of these things.
32:03.710 --> 32:05.630
Vitaly Feldman: How they fit into these this impression.
32:06.500 --> 32:08.210
Rebecca Willett: Thanks okay.
32:08.510 --> 32:17.210
Vitaly Feldman: So, so far, I will actually only talked about fitting I have select me now is actually related to the definition of memorization that I gave.
32:18.920 --> 32:32.120
Vitaly Feldman: And for this let's very briefly be called the definitional memorization, as you may remember that is the difference between the classification accuracy on an example, between the case when the example is included in the data set and.
32:33.860 --> 32:37.580
Vitaly Feldman: The case when they example is excluded from the dataset.
32:39.350 --> 32:46.400
Vitaly Feldman: Now it's actually not hard to see that the expected accuracy of an algorithm on an example that was excluded.
32:46.820 --> 32:53.840
Vitaly Feldman: is equal to the expected to accuracy of the algorithm and the distribution so let's just leave one out error.
32:54.530 --> 33:05.810
Vitaly Feldman: And this means that if the learning algorithm is sufficiently if the learning problem is sufficiently hard then many examples cannot be classified accuracy accurately when they are removed from the data Center.
33:07.070 --> 33:11.660
Vitaly Feldman: And by our definition to fit such examples, we need to memorize them.
33:12.950 --> 33:17.240
Vitaly Feldman: Just don't predict, but you need to fit them, you need to memorize them so.
33:18.770 --> 33:28.340
Vitaly Feldman: Now just by using the feeling from the previous slide we are or two slides will take that memorization is necessary for getting close to optimal accuracy and this basically.
33:28.700 --> 33:39.620
Vitaly Feldman: The connection is that if the problem is sufficient to be hard and fitting is necessary, then you actually need to memorize in you need to be able to be kind of this eager to memorize to.
33:41.120 --> 33:42.290
Vitaly Feldman: To get optimal occurs.
33:44.450 --> 33:49.400
Vitaly Feldman: So now, let me also very briefly mentioned how do we explain how do we.
33:50.690 --> 34:00.530
Vitaly Feldman: extend this model to continuous domain not sort of these discrete domains that consists of individual points with high lots of the high frequencies.
34:02.540 --> 34:13.010
Vitaly Feldman: And, of course, like most learning problems that amaze me like images of course is the main is high dimensional and the probability of each point is negligible, so we cannot directly apply the model.
34:14.030 --> 34:19.430
Vitaly Feldman: In fact fitting or or memorizing of individual points does not sort of by itself make.
34:20.540 --> 34:23.660
Vitaly Feldman: any difference, you can have to assume some kind of.
34:24.950 --> 34:32.960
Vitaly Feldman: some kind of sampling of how fitting in individual point affects the model, maybe in the nearby part of the domain.
34:34.580 --> 34:36.170
Vitaly Feldman: So deep stand.
34:38.300 --> 34:50.510
Vitaly Feldman: Our model and results are the setting will view the marginal distribution our points as a mixture of distributions each representing a subpopulation so more formally will.
34:52.040 --> 34:53.330
Vitaly Feldman: Have that these kind of.
34:54.890 --> 35:05.300
Vitaly Feldman: subpopulation or distributions m one through empty, all of which have will assume precipitously have this joint support X 130 which are part of the domain.
35:06.470 --> 35:13.610
Vitaly Feldman: And will define a process, the generic marginal distribution in a very similar way for every single person will pick a frequency.
35:15.020 --> 35:28.580
Vitaly Feldman: From the frequency prior as before randomly independently and then normalized one, and then the marginal distribution will be defined as the mixture distribution of the separate relations with the tools and normalized frequencies.
35:30.380 --> 35:40.460
Vitaly Feldman: And we also assume that the labeling function, you know distribution F is constant oversimplifications is each population, we think of it as a as something we should have some class.
35:43.970 --> 35:56.840
Vitaly Feldman: Until and as before the problem instance is defined by picking a marginal distribution using this process i've just described and then picking a label labeling function independently and randomly from some distribution capital F.
35:57.890 --> 36:06.830
Vitaly Feldman: Over and then of course sampling, the point from the distribution D and labeling abusing the function F.
36:08.180 --> 36:14.150
Vitaly Feldman: And for the model defined in this way will our results can be relatively.
36:15.260 --> 36:27.950
Vitaly Feldman: easy to extend our main results with the setting with some additional natural assumptions on the learning algorithms that just basically you have to couple the predictions on the stipulation for the prediction on on points and that's the population.
36:29.390 --> 36:35.870
Vitaly Feldman: And so the assumption is roughly says the prediction I singles basin is correlated with a prediction on the entire population.
36:37.550 --> 36:43.520
Vitaly Feldman: So that's all that I have time to say about theory that you know describe the empirical results.
36:44.420 --> 36:58.070
Vitaly Feldman: And the goal of these set of experiments is to understand which examples are memorized and how the predictions of the model are affected by by the examples that we know are known to be Member memorize.
36:59.120 --> 37:08.090
Vitaly Feldman: Unfortunately it's not entirely straightforward If you recall our definition of label memorization it requires that we you know to measure it, it requires that we train a model.
37:08.960 --> 37:15.440
Vitaly Feldman: Well, leaving out a single example, so you know to figure out where the example is memorized, we have to leave it out and.
37:15.830 --> 37:30.770
Vitaly Feldman: doing it for each of the trainings apples to measure the memory stations would require training order and models and with all the randomness it would make entire computationally infeasible even on simple data set that even if with Google three sources.
37:32.450 --> 37:37.730
Vitaly Feldman: So to deal with this issue, we actually propose a proxy for musician value specifically instead of measuring.
37:40.430 --> 37:49.670
Vitaly Feldman: memorization relative to the entire data set we looked at the expected monetization valid value relative to a randomly chosen subset of the data set.
37:50.270 --> 37:59.090
Vitaly Feldman: As have some fixed size em and here's the definition we basically, this is the detention or memorization value, but instead of measuring it over dataset we measured over.
37:59.510 --> 38:16.760
Vitaly Feldman: In expectation over random subsets of fixed size of the data set and we'll use the random subset of size, which is pretty large cool sounds like a 70 or 80% of the original dataset size and the great thing about this definition is that actually This definition is.
38:18.560 --> 38:31.460
Vitaly Feldman: Now computationally feasible, will have a relatively simple method to to estimate it in fact it only requires training have one models, a little bit it's a large concept, and this oh.
38:33.860 --> 38:34.430
Vitaly Feldman: So.
38:35.990 --> 38:37.550
Vitaly Feldman: So, and perhaps.
38:39.530 --> 38:42.350
Vitaly Feldman: The nice thing about it is actually what we.
38:44.540 --> 38:50.570
Devdatt Dubhashi: can explain why you don't have to train anyway, like models now I mean what is the only constant number of models.
38:50.960 --> 38:53.840
Devdatt Dubhashi: yeah using point Point seven times in.
38:53.960 --> 39:01.820
Vitaly Feldman: Data points, yes, so so so the reason for this is just kind of very simple linearity of expectation.
39:02.390 --> 39:02.930
Vitaly Feldman: I can.
39:03.650 --> 39:08.300
Vitaly Feldman: actually have a slide in it, but it's basically because you can.
39:10.730 --> 39:11.600
Vitaly Feldman: You can basically.
39:12.920 --> 39:20.990
Vitaly Feldman: So then now there's the Malaysian score can we decompose it into two parts, and you can effectively use a single sample single random sample.
39:21.710 --> 39:30.830
Vitaly Feldman: single set of models train on the random subset of the data set and that estimate those two terms separately for every example by just looking.
39:31.400 --> 39:46.670
Vitaly Feldman: At at older models in this random subset of the data that include certain training examples and older models that exclude that example in that random subset of models and this will give you this allow you to estimate both.
39:47.930 --> 39:50.690
Vitaly Feldman: Both of those terms separately so it's.
39:52.370 --> 40:01.220
Vitaly Feldman: It just because again we'll have enough enough samples when you take around themselves, if you will have enough examples of each type.
40:02.570 --> 40:12.710
Vitaly Feldman: Maybe we can return to this question, because I do have one slide at the end on this which doesn't include the proof, but the proof is also simple as just the narrative expectation using them in the right place.
40:14.210 --> 40:17.330
Vitaly Feldman: Anyway, it's still actually somewhat computationally intensive.
40:19.430 --> 40:24.500
Vitaly Feldman: But again now using some Google gpus we were able to.
40:25.550 --> 40:35.330
Vitaly Feldman: To do it and, at least at the end, I believe that the results are quite convincing Here are some examples, for example.
40:36.410 --> 40:41.480
Vitaly Feldman: with lots of examples of training examples with large memorization values from the image.
40:42.200 --> 40:51.590
Vitaly Feldman: bobsled class, as you can see, they contain mislabeled examples outliers but also correct the label but relatively unusual examples of a box lead.
40:52.460 --> 41:07.370
Vitaly Feldman: And for comparison Here are some examples of bobsled with zero memorization score score, which is basically the not require any memorization, as you can see, these are fairly clear typical examples over bobsled so they've been up there, so this fits perfectly with intuition.
41:09.620 --> 41:12.680
Vitaly Feldman: We then once you have measured the squares, we can.
41:14.210 --> 41:22.490
Vitaly Feldman: show the overall the memorization and memorize examples are useful for learning and we do it in the following way we.
41:23.780 --> 41:24.530
Vitaly Feldman: Basically.
41:26.090 --> 41:29.960
Vitaly Feldman: train the model, while removing all the memorize memorize examples.
41:31.430 --> 41:41.090
Vitaly Feldman: Who score is above a certain memorization threshold and we plot sort of so so after you remove the examples you can train a.
41:42.170 --> 41:43.370
Vitaly Feldman: model on.
41:45.470 --> 41:55.790
Vitaly Feldman: On the remaining examples and the orange line here in the spot, shows the accuracy that we get after the removing this distraction all.
41:56.540 --> 42:10.010
Vitaly Feldman: The examples so and and the blood below shows that actually shows what is the fraction of examples which have this memorization the three of the data that remains after the memorize examples are removed.
42:11.270 --> 42:23.570
Vitaly Feldman: And what's interesting another one with the Green Line shows is that the accuracy, when we remove memorize examples, the accuracy drops even more than if we work to remove the same number of completely randomly chosen.
42:24.890 --> 42:39.110
Vitaly Feldman: examples from the data set so memorize example are important for hi chrissy learning even more important, but just completely randomly chosen examples, even though a lot of them right examples are just outliers, but they also contain.
42:40.220 --> 42:41.600
Vitaly Feldman: Useful rare examples.
42:43.190 --> 42:43.640
Vitaly Feldman: So.
42:44.660 --> 42:57.740
Vitaly Feldman: So, so this is just the overall effect our sort of this long tail theory so just an example are memorize to improve predictions on examples from the same rare situations, so we also wanted to test this theory.
42:58.520 --> 43:06.410
Vitaly Feldman: And, but to 10 tested, we need to understand how each memorized example affects the predictions of the model on other examples.
43:07.430 --> 43:13.790
Vitaly Feldman: and natural way to do this is defined so define the following influence value for each.
43:16.220 --> 43:25.700
Vitaly Feldman: For a data set es index is an arbitrary example X or y label example like spice the influence of a trading example with index I.
43:27.350 --> 43:40.550
Vitaly Feldman: On this example on the prediction accuracy on on this labeled example X, Y is just the difference between the accuracy of the algorithm when the example is.
43:41.510 --> 43:58.700
Vitaly Feldman: I put it in the training set and when it is excluded in that from the data says, this is the influence of the particular example I and, as you can see again the computing of this value values would be computationally feasible because require this live on all the definition.
43:59.960 --> 44:00.410
Vitaly Feldman: We.
44:01.220 --> 44:08.540
Devdatt Dubhashi: Is that assumption yeah that it's always monotone I mean that if you leave one out the accuracy always going to be less for you can imagine that this could.
44:08.960 --> 44:10.190
Devdatt Dubhashi: actually increase no.
44:11.060 --> 44:20.540
Vitaly Feldman: No we're not assuming that it's monotone we just this is this is how we define the fact that, as you can see, can be negative, and yes, for example, and mislabeled example could.
44:21.830 --> 44:36.410
Vitaly Feldman: Actually decrease the have a negative influence and sometimes actually examples do have negative influence, but yeah and it correctly label from example from a sense of population will typically help have positive influence.
44:37.850 --> 44:53.840
Vitaly Feldman: So, so this is differential influence it would be great it's kind of a natural way to define it would be great to be able to compute it but, again, it is not computationally feasible as defined and we'll deal with it using the same trick of defining the proxy that measures, the influence.
44:55.070 --> 45:00.170
Vitaly Feldman: Relatively random subset of the data sets and and as before this.
45:02.120 --> 45:05.690
Vitaly Feldman: proxy can be a computer efficiently and.
45:06.980 --> 45:14.810
Vitaly Feldman: and using this approach, we estimate the influences of all the memorize training examples on all test examples on several data sets.
45:15.290 --> 45:22.130
Vitaly Feldman: And the option of the result is that a large part of the utility of memorize examples comes from these kind of high influence pairs.
45:22.730 --> 45:28.130
Vitaly Feldman: And what are these high influx there's these are there are some examples which a training example has large influence.
45:28.940 --> 45:42.260
Vitaly Feldman: On just a single test example and essentially no effect on accuracy of all other tests examples, and this is exactly the effect of memorizing a representative over air but useful celebration in our model.
45:43.220 --> 45:54.440
Vitaly Feldman: Overall, we find that that that for about 3% of examples in the image net that said there's a single memorized training examples that has large influence on the accuracy.
45:54.920 --> 46:08.390
Vitaly Feldman: Of the text that's example in this blog we show how large is the influences it's typically not huge it's the most example is in the range within Point two 1.4, but it is.
46:09.950 --> 46:12.500
Vitaly Feldman: But it is still significant influence on the activism.
46:14.690 --> 46:21.980
Vitaly Feldman: But what I find even more compelling is not just this kind of overall somebody but actually looking at these high influence pairs.
46:22.820 --> 46:38.690
Vitaly Feldman: Here are some examples from the image net data set with the training example on the left and the test example on the right, as you can see, these are all unusual examples of their class which would kind of describe on the on the very left here, this is the class.
46:39.950 --> 46:46.490
Vitaly Feldman: But they're very similar to each other, this are rare, but special and, here are some.
46:48.260 --> 46:51.680
Vitaly Feldman: Examples of the same time from overseas or 100 data set.
46:52.730 --> 46:59.330
Vitaly Feldman: So again, these examples fit very well with the intuition and the theory that i've described.
47:01.400 --> 47:08.810
Vitaly Feldman: Let Let me also briefly mentioned that these examples are representative of high influence bears they were not cherry picked for this presentation.
47:09.380 --> 47:17.960
Vitaly Feldman: You can actually see many more such examples and download all the influence scores on a website that i'll that we have set up and also it on the next slide.
47:20.120 --> 47:31.700
Vitaly Feldman: So let me summarize this worker our shows that label memorization is necessary for achieving close to optimal diarization whenever the data distribution as long tail.
47:32.990 --> 47:39.530
Vitaly Feldman: The need for label memorization explains why we often see large organization gaps and models that.
47:40.610 --> 47:53.870
Vitaly Feldman: Actually, end up interpolating beta and I think it's worth noting that this result is not specific to deep learning and suggested normalization is an inherent part of learning more generally, when the distributions.
47:55.040 --> 47:57.320
Vitaly Feldman: Are kind of these long tail distributions.
47:58.820 --> 48:09.050
Vitaly Feldman: If you want to learn more you can find the theoretical part in this paper listed here the paper from last year, the experiments.
48:10.550 --> 48:20.060
Vitaly Feldman: appeared in europe's last year and, as I mentioned, we have set up a website, which gives additional examples of these memorization high influence.
48:20.600 --> 48:33.470
Vitaly Feldman: pairs you have to also allows you to download all the memorization and influence estimates that we have computed so that you can do experiments with these values without having to what about melt somewhere I certs.
48:34.790 --> 48:43.460
Vitaly Feldman: Let me also mentioned a follow up work that I did with Gavin brown mark bond Adam Smith and knowlton the war.
48:45.200 --> 48:52.520
Vitaly Feldman: In in this work, we address the empirical observation that in some cases, deep learning models memorize entire data points, but just labels.
48:53.060 --> 49:04.520
Vitaly Feldman: And we explained theoretically why even such type of meditation can be necessary for certain types of data distributions again necessary to achieve closed optimal generation.
49:06.110 --> 49:18.410
Vitaly Feldman: In terms of future work, I think that one of the most important problems in learning theory is finding more faithful models of data, and I believe that modeling of the long tail decision, instead of one very natural direction.
49:19.460 --> 49:28.670
Vitaly Feldman: For this type of work, our work provide some tools do I think, to get started in this direction, but there are, of course, a lot of work that needs to be done.
49:29.420 --> 49:37.760
Vitaly Feldman: Another important direction for future work is understanding the implications of these results for settings where memorization is limited, for example, differentially private.
49:38.420 --> 49:51.470
Vitaly Feldman: Learning or model compression it had actually been observed empirically that limiting memorization has different effects on subgroups with different frequencies sort of our model kind of explain why this happens.
49:52.610 --> 49:55.940
Vitaly Feldman: But we still need to understand how to best address these effects.
49:57.050 --> 50:00.830
Vitaly Feldman: Finally, I think that the estimators of immigration influence that we give.
50:02.090 --> 50:06.020
Vitaly Feldman: will have other applications I think it's worth exploring those applications.
50:07.070 --> 50:11.390
Vitaly Feldman: that's all that I have to say thank you for being for being maybe a bit late.
50:13.370 --> 50:26.180
Olgica Milenkovic: Thank you for really interesting talk Vitaly and I think we still have some time for questions before we go to the breakout room anyone wants to ask any additional questions.
50:28.460 --> 50:39.830
Devdatt Dubhashi: So certainly would have a conclusion me that if you if you said what you mentioned right at the end, if you take a compressed model that's not going to be your it's not going to have a high.
50:40.910 --> 50:43.670
Devdatt Dubhashi: performance on longtail distribution.
50:45.140 --> 50:48.050
Vitaly Feldman: Yes, so it so compression well.
50:49.160 --> 50:49.970
Vitaly Feldman: Inevitably.
50:51.230 --> 51:02.420
Vitaly Feldman: limit memorization and therefore reduce potentially reduce the accuracy what is what I was trying to say in this in this last bullet is that.
51:03.740 --> 51:07.730
Vitaly Feldman: This kind of review reduction in accuracy, will affect.
51:09.260 --> 51:10.070
Vitaly Feldman: Different.
51:11.510 --> 51:28.130
Vitaly Feldman: sort of subgroups differently so you have sort of a majority and minority some group because of sort of so, so this is a violation of limiting memorization effects that how well you perform on the tail of the distribution so that for different subgroups.
51:29.150 --> 51:43.010
Vitaly Feldman: You might have the tail might have different weights so because, again, if you have a few examples, the relative part of the tail that that has kind of low frequency of your of one over n.
51:44.060 --> 51:54.350
Vitaly Feldman: will have different weight, which means the the effect that will have limited numbers will be different for different subgroups and will depend on the on the frequency.
51:55.430 --> 52:06.650
Vitaly Feldman: And, someone said at one point is that, perhaps even, even if the overall effect could be very small, it can affect some very small subgroup disproportionately.
52:10.400 --> 52:11.480
hope that makes sense.
52:17.030 --> 52:17.690
Olgica Milenkovic: Sorry go ahead.
52:18.740 --> 52:26.930
Devdatt Dubhashi: And just to follow the remainder been empirical studies, where you compress them or you know you can get 90% compression can cut them all down to.
52:27.470 --> 52:39.800
Devdatt Dubhashi: 10% of the Rainbow and still not you lose overall accuracy than your prediction will be that this highly compressed model will will will will lose out on from particular subpopulation so.
52:39.860 --> 52:54.200
Vitaly Feldman: Okay, it depends what where do you start from I mean certainly some all of them, we may be compressed without any loss, you may be getting essentially the same model, as you had before it's the again depends on their how much sort of.
52:55.400 --> 53:06.290
Vitaly Feldman: spit how much sort of I don't know that had the how much extra organization, the model, had it may be possible to compress without any loss but beyond certain point.
53:06.800 --> 53:16.820
Vitaly Feldman: compression we know, has to start limiting the the ability of the capacity of the model and particularly its ability to memorize.
53:18.080 --> 53:26.090
Vitaly Feldman: Examples i'm not saying that in that particular example, might be possible yeah in general, you might possible to compress things without losing anything.
53:35.570 --> 53:36.950
Olgica Milenkovic: Any any other questions.
53:39.560 --> 53:47.570
Rebecca Willett: A couple of words about connections between the analyses you present here and analysis using stability theory.
53:49.310 --> 53:50.540
Vitaly Feldman: So.
53:52.760 --> 53:56.960
Vitaly Feldman: The question oh what maybe let's What do you mean specifically by stability theory.
53:57.980 --> 53:58.940
Vitaly Feldman: Just the Sir.
53:59.240 --> 54:05.720
Vitaly Feldman: i'm not familiar with uniform like the bounce based on jerseys and bounce based on stability me or.
54:05.750 --> 54:06.560
Rebecca Willett: Something yeah.
54:07.670 --> 54:17.060
Vitaly Feldman: And so, so I kind of broadly covered all these methods in in the classical theatre, because they still split the.
54:18.110 --> 54:24.020
Vitaly Feldman: The tradition Arab into these two terms of generation gap and empirical error and try to.
54:25.760 --> 54:33.260
Vitaly Feldman: bar and then they found that the generation gap using stability and basically.
54:35.990 --> 54:38.540
Vitaly Feldman: These methods cannot explain again this kind of.
54:40.460 --> 54:56.180
Vitaly Feldman: eagerness to to fit a memorized because, in order to be able to memorize and fit you also need to to be kind of unstable you're actually kind of reacting to every single example so so these these theories also do not explain.
54:57.980 --> 54:59.300
This kind of memorization.
55:00.980 --> 55:12.710
Rebecca Willett: I mean, certainly, I agree with that um, I guess, I was just thinking you know, more broadly speaking, when you see there's lots of different definitions for stability, but I think of it as somehow corresponding to.
55:13.160 --> 55:22.700
Rebecca Willett: sort of like an average of the influence measure that you presented in that you look and see what kind of predictor you get with a particular sample and then without it.
55:24.320 --> 55:33.740
Rebecca Willett: And so on that level, I felt like I was curious whether you could think about you know stability is thinking of more of an average case.
55:34.520 --> 55:46.790
Rebecca Willett: measure, whereas the things that you're doing, are you know not doing that averaging in There therefore better able to handle what's going on in the table says, I mean is that a reasonable way of thinking about it.
55:49.640 --> 56:09.470
Vitaly Feldman: I know i'm not sure I understand that the key point, there are some connections so memorization value is actually interested we kind of like stability it's very similar to leave one out stability, I don't know if you were in any way referring to that fact that to how.
56:11.390 --> 56:16.970
Rebecca Willett: So yeah sorry i've Maybe my question isn't super clear and i'm happy to discuss it offline.
56:17.840 --> 56:18.890
Rebecca Willett: Whole audience here.
56:18.890 --> 56:21.530
Vitaly Feldman: yeah yeah yeah I would probably be easier.
56:22.910 --> 56:24.350
Vitaly Feldman: To do it offline after them.
56:26.750 --> 56:29.660
Olgica Milenkovic: yeah, can I ask just one really quick question.
56:31.490 --> 56:37.730
Olgica Milenkovic: When you describe the error formula you mentioned unique points which was really the interesting part.
56:38.210 --> 56:47.750
Olgica Milenkovic: And the error that comes from the unique points in the zip distribution really shows that you penalize you get a lot of penalties if you don't memorize.
56:48.230 --> 56:48.950
Olgica Milenkovic: unique points.
56:49.130 --> 56:55.790
Olgica Milenkovic: What about other types of distribution, if you don't have heavy tales or different types basically of distribution would.
56:56.090 --> 57:10.640
Olgica Milenkovic: The frequency of the points player role would there be some kind of extension that says points that appear twice or three times said that kind of panel or penalty associated with them, so is there a spectrum.
57:11.750 --> 57:19.400
Olgica Milenkovic: For the different frequencies, that you can present here so rather than those that appear only ones, if you had different type of distributions.
57:20.630 --> 57:21.410
Vitaly Feldman: Would you be oh.
57:21.830 --> 57:22.580
Olgica Milenkovic: yeah okay sure.
57:23.120 --> 57:33.710
Vitaly Feldman: yeah great question yeah so first of all, in terms of just defining things, yes, there is a theory on this version of the theater in which has in fact that the Ottoman status with it.
57:35.210 --> 57:47.960
Vitaly Feldman: has an extra parameter it's a sum of terms each for every frequency for those we can occur once, twice each of those whenever you don't fit you will pay.
57:49.550 --> 57:52.520
Vitaly Feldman: With some penalty and that penalty is a function of the frequency.
57:54.770 --> 58:08.990
Vitaly Feldman: Just there's some kind of adjustment basically instead of like square you get the frequency plus one, and the bottom, you will have it's just a slight slight adjustment of this formula that, in general, we know that.
58:10.580 --> 58:18.110
Vitaly Feldman: If you don't have these kind of these kind of frequencies don't want on the order of one over and then.
58:19.130 --> 58:21.140
Vitaly Feldman: it's Okay, so if you.
58:25.850 --> 58:35.780
Vitaly Feldman: I mean the point there is that if you don't have any frequencies that are on the order of one over n you don't run into this situation where you.
58:36.590 --> 58:49.040
Vitaly Feldman: cannot tell whether the what is the true frequency of the point so, for if something occurs with high, but like let's say even 10 times you'll definitely know that it's not an outlier.
58:49.880 --> 58:55.580
Vitaly Feldman: it's just that this kind of probability of something being outlier quickly the case and then this point is sort of.
58:57.740 --> 59:10.130
Vitaly Feldman: yeah that is this the others, no longer needs to you can basically not memorize and still achieve optimal accuracy, because at that point you if you don't have these kind of ambiguous points you.
59:13.160 --> 59:13.940
Vitaly Feldman: will be able to.
59:15.410 --> 59:18.680
Vitaly Feldman: Not memorize outliers because you can identify those.
59:19.010 --> 59:20.810
Olgica Milenkovic: yeah that makes sense yeah thanks.
59:22.760 --> 59:24.620
Olgica Milenkovic: So do we have any other questions.
59:27.320 --> 59:32.780
Olgica Milenkovic: If not, let us thank Vitaly again for a really interesting talking, we can move now to the.
59:34.490 --> 59:43.280
Olgica Milenkovic: chat room or coffee room whatever we call it, and feel feel free to start asking questions and organize yourself in the IMS I.
59:43.730 --> 59:58.130
Olgica Milenkovic: team flow room where Vitaly will take any further questions or engage into a discussion Thank you all and it's great to see you all today, and thanks for being with us in in the midweek session Thank you bye.
59:58.970 --> 59:59.600
Vitaly Feldman: Thank you bye bye.
59:59.780 --> 01:00:01.040
Olgica Milenkovic: i'll see you and people.