WEBVTT
1
00:00:03.689 --> 00:00:13.110
Matthew Kahle: Welcome back everyone we're very happy our next Caucus by a long from university of utah and she'll tell us about sketching merge trees thanks Ben.
2
00:00:16.529 --> 00:00:17.520
Matthew Kahle: Maybe i'll mute myself.
3
00:00:17.580 --> 00:00:18.060
Bei Wang: Thank you.
4
00:00:18.690 --> 00:00:27.450
Bei Wang: So this is joined work with my student ninja leanne and also my former PhD students Sarah was right now, a postdoc in Michigan.
5
00:00:29.970 --> 00:00:30.450
Bei Wang: So.
6
00:00:31.500 --> 00:00:43.890
Bei Wang: Let me give some acknowledgement other way, I think our sponsor from Delhi and also this idea started, while was assignments almost more than two years ago, where there was a sinister sketching.
7
00:00:45.000 --> 00:00:52.560
Bei Wang: So, and then there is currently there's an archived version of this paper but we're going to update as soon so maybe, hopefully in a week.
8
00:00:53.580 --> 00:01:03.090
Bei Wang: So the idea of this is very much at the intersection of tableau today analysis and data visualization where the most of the motivation is coming from data visualization.
9
00:01:03.780 --> 00:01:13.410
Bei Wang: But on the high level, the one takeaway message is, we can combine the idea of matrix sketching where's topological descriptors such as merge trees.
10
00:01:14.760 --> 00:01:22.770
Bei Wang: And I think i'm very excited about this line of work, because I think it's a nice combination between randomized linear algebra and topology.
11
00:01:23.730 --> 00:01:33.690
Bei Wang: And finally i'm looking for postdocs so it's up here's a quick advertisement to start the fall of this year, so if you're interested, please send me an email.
12
00:01:34.590 --> 00:01:39.780
Bei Wang: Okay, so let's get to it um so as I mentioned that the motivation is really coming from.
13
00:01:40.320 --> 00:01:50.250
Bei Wang: Data visualization specifically coming from on samples from scientific scientific simulation so here is a three instances of a simulation where this is called a corner flow.
14
00:01:50.790 --> 00:02:01.350
Bei Wang: So what happened is that there is, you know, this area is a war, this area is another war, and you have some sort of flow that is sort of injected you know from the lower left corner.
15
00:02:01.770 --> 00:02:09.840
Bei Wang: And what we're seeing is basically, you know as time goes on, some of those instances from this simulation so you're looking at a flow simulation.
16
00:02:10.320 --> 00:02:21.000
Bei Wang: And, of course, you know the specific topological structure i'm going to use this code emerge tree but there's many other types of descriptors you can use, for example, seems like more smell accomplices read graphs and so on.
17
00:02:21.300 --> 00:02:27.510
Bei Wang: And in some sense that you know I would even consider persistence diagram to be a type of sort of topological descriptor.
18
00:02:28.140 --> 00:02:40.020
Bei Wang: But the idea of ensemble Is that what you have two variations one is a time Barry on support in this case, you know as time goes on, you see some patterns in the flow and you want to kind of capture this pattern.
19
00:02:40.740 --> 00:02:46.200
Bei Wang: there's other similar another version of simulate ensemble is where you'll have multiple simulation.
20
00:02:46.590 --> 00:02:58.290
Bei Wang: Where the underlying parameters at least two simulation actually changes so each time to change the parameter little bit you get a different slightly different simulation, for example, if I want to simulate wind flow in the salt lake Valley.
21
00:02:58.590 --> 00:03:08.460
Bei Wang: Each time I changed the parameter I get a slightly different simulation, so this is also applicable in that setting where you're on some boys coming from multiple different parameters.
22
00:03:09.240 --> 00:03:11.700
Bei Wang: So in this particular case, if I have this.
23
00:03:12.120 --> 00:03:24.420
Bei Wang: Multiple instances of a simulation one way to study it is to say, well okay i'm going to try to capture some interesting topological changes in there and no very simple way of doing this is looking at a scanner field.
24
00:03:24.720 --> 00:03:40.200
Bei Wang: associated with that a wizard wizard wizard sample Member in this particular case, the scanner field i'm using is the magnitude of underlying flow so and then the points that i'm joining are the critical points in that scanner field so red means local maximum.
25
00:03:41.340 --> 00:03:50.460
Bei Wang: White means settle and blue means local minimum, and so the idea of the whole sketching is if you're giving me this large collection.
26
00:03:51.060 --> 00:03:55.620
Bei Wang: I would like to find out more compressed representation which I will make that precise very soon.
27
00:03:56.070 --> 00:04:05.310
Bei Wang: So I wonder have some sort of compression I want a smaller set of representation let's say I have 100 instances, I only wanted represented with I don't know 15 instances.
28
00:04:05.850 --> 00:04:21.000
Bei Wang: The second one is that you know if I give you a sample like this, I would like also like to find representatives So what are the interesting topological changes that is happening on a nice can't find the representative ones, and also can find all liars okay.
29
00:04:22.020 --> 00:04:29.850
Bei Wang: So that's the motivation, so the question we're asking is that if you're giving me a set of merge trees, I would you find our tree very soon.
30
00:04:30.240 --> 00:04:37.440
Bei Wang: If you'll give me a set of mercury's to find a basis set such that every tree, you know original input.
31
00:04:37.770 --> 00:04:52.050
Bei Wang: can be approximately reconstructed as a linear combination of the merge tree in my basis set okay so here I just give you an example where there's three merge trees, so they arise from those three scale or field okay.
32
00:04:52.740 --> 00:04:58.950
Bei Wang: So what is emerged tree, if you have not seen this emerge story is a basic type of topological descriptors.
33
00:04:59.340 --> 00:05:12.030
Bei Wang: That record the connectivity amount of Sub level set of the function of it is related basically to more siri were more serious health care providers, you know our schema field data by political changes in the south lavasa.
34
00:05:12.780 --> 00:05:22.530
Bei Wang: and isolated critical points so in this particular case on the Left corner, this is my scary field which is basic very simple skill or field of mixing and mixing of dowsing.
35
00:05:22.770 --> 00:05:35.220
Bei Wang: The Middle picture is just kind of visualize the graph of this function, where there's like multiple sort of a button of the make that is showing surrounding those local minima right here, and then right hand side is my mercury.
36
00:05:36.540 --> 00:05:40.500
Bei Wang: So what does the mathematical definition of my merge tree okay so.
37
00:05:40.920 --> 00:05:51.330
Bei Wang: On for the time being, you know, for all the data and testing so far is a two dimensional scared of you, but you can imagine, this definition, can be extended to higher dimension, but in this case let's say I have.
38
00:05:51.660 --> 00:05:55.440
Bei Wang: You know, a piece of two dimensional domain and apple skeeter function defined.
39
00:05:56.430 --> 00:06:04.470
Bei Wang: and merge trees studies, the connectivity amount of Sub level set of this function so, for example, in this case, this is a two dimensional function Apps.
40
00:06:04.860 --> 00:06:12.630
Bei Wang: That is defined on this two dimensional domain and m have a is my notation of Sub level side is all the points in my domain that has function value.
41
00:06:13.560 --> 00:06:25.320
Bei Wang: Smaller equal to some special and the merge trades basically on the definition goes back to a definition of equivalence relation amount of points in the domain so two points are considered equivalent.
42
00:06:25.770 --> 00:06:31.980
Bei Wang: If they have the same function value and they belong to the same connected component in the sub level set.
43
00:06:32.790 --> 00:06:42.630
Bei Wang: And then the merge tree is basically the quotient space obtained by gluing all the points together that is equivalent so now the way you think about it is for all the points at identified.
44
00:06:43.410 --> 00:06:52.290
Bei Wang: To be equivalent I kind of shrink them to one single representation so let's give example so at this particular threshold here, this is my threshold a.
45
00:06:53.220 --> 00:07:09.150
Bei Wang: At this level if i'm looking at my graph of my function, I have i'm looking at a sub level set of this, so this is one piece of Sub level said, this is another piece of Sub level set Okay, so in this particular case, if I have two points.
46
00:07:11.940 --> 00:07:27.360
Bei Wang: X and y I kind of decided that X is equivalent to why, if they are at the same height value and if they belong to the same connected components okay so actually Maybe I should production, right here X and y at this level.
47
00:07:29.640 --> 00:07:35.640
Bei Wang: Okay, so, if you look at the corresponding merge tree the corresponding mercury has to component.
48
00:07:36.390 --> 00:07:44.910
Bei Wang: that's right here they're responding to to to connect component in a sub level so so the way to construct emerge tree is you start with our small.
49
00:07:45.360 --> 00:07:54.420
Bei Wang: high value and you slowly increase that function value and every instance you're looking at how many connected components, do I have in the sub lavazza.
50
00:07:54.810 --> 00:08:00.810
Bei Wang: And then that's what you get on the right hand side it captures relation between the critical points in this case, you know.
51
00:08:01.140 --> 00:08:11.970
Bei Wang: The local minimum of my function of the leaf of my merge tree and then the sub level said one merge etc point, so this is a satellite, this is not a standalone and so on okay.
52
00:08:12.270 --> 00:08:20.280
Bei Wang: So that's a particular type of two blocks of descriptor that really captures the connectivity or topology among the sub level setting up the function.
53
00:08:22.140 --> 00:08:31.770
Bei Wang: So let's go back to the question I want to ask if you give me a set of mercury can fight a basis so just might be difficult to start with, but.
54
00:08:32.280 --> 00:08:40.680
Bei Wang: Something we know already when we know PCA, because if you'll give me a set T, which is not a set of high dimensional vectors.
55
00:08:41.280 --> 00:08:52.770
Bei Wang: We know how to find a basis set such a every high dimensional vector is approximated reconstructed from an inner combination of those spaces set right specifically let's give an example of a PCA.
56
00:08:53.970 --> 00:09:02.910
Bei Wang: The way PCA works is that, if I have a data matrix where each column is my high dimensional vector and they are of dimension D.
57
00:09:03.780 --> 00:09:20.400
Bei Wang: And then, what we do is we find some sort of approximation of this data matrix by data, the data matrix a hat wear a hat is sort of factored into two matrices one, is what I call the basis this basis is a.
58
00:09:21.060 --> 00:09:29.550
Bei Wang: it's a it's a matrix to where every single column is what I will call later on a basis vector and then there is a coefficient.
59
00:09:30.240 --> 00:09:44.490
Bei Wang: So now what happened is in PCA, you have you you give a data matrix and you find the approximation a hat that minimize it for genius more such that a hat can be factored into be times, why.
60
00:09:45.330 --> 00:09:53.760
Bei Wang: Okay, and then what you do, that every single high dimensional vector can be represented as a linear combination of those faces factors okay.
61
00:09:54.030 --> 00:10:05.520
Bei Wang: So if you say well i'm going to run PCA where, if my K is equal to two that means i'm going to try to find the two dominant direction where everybody else can be rejected, can be projected to.
62
00:10:05.760 --> 00:10:12.660
Bei Wang: And the coefficient is exactly you know what the coefficient with respect to each of those spaces, but I can essentially you know, this is what.
63
00:10:13.050 --> 00:10:19.800
Bei Wang: I think matrix sketching is useful, is it really gives you a more of a compress representation of your high dimensional point guard.
64
00:10:20.100 --> 00:10:32.820
Bei Wang: Because if you say you started with capital N number of data points but towards the end you can represent almost all a well pretty well for everybody, what a lot of them were using small K for K is much smaller than.
65
00:10:33.450 --> 00:10:45.240
Bei Wang: Right well, of course, another way to think about PCA is that those those are those vector be Ashley form or soccer no basis for subspace and the data just projecting to their success.
66
00:10:46.200 --> 00:10:54.690
Bei Wang: So we know how to do this for high dimensional vector, the question is, what if my data is no longer high dimensional vector my data is actually.
67
00:10:55.080 --> 00:11:01.200
Bei Wang: a bunch of complicated topological descriptors in this case a bunch of merge trees, what can we go from there.
68
00:11:02.010 --> 00:11:16.050
Bei Wang: Well, first of all let me talk about something called schedule, so I like to use a word sketch boys to say well in the previous case scenario, if my input high dimensional vector can be expressed as a linear combination of a smaller set.
69
00:11:16.380 --> 00:11:24.900
Bei Wang: I will say those set of vectors of schedule so really my question boy don't to our merge tree schedule Okay, it also yes, yes.
70
00:11:25.230 --> 00:11:31.320
Bei Wang: So, and then there's multiple ways to do matrix sketching there's actually a vast literature in their.
71
00:11:31.770 --> 00:11:40.980
Bei Wang: PC you can consider a PC as well into the one we're going to talk a lot about today is called column software selection it just to say in order to formulate to be.
72
00:11:41.490 --> 00:12:01.530
Bei Wang: i'm going to choose be among a subset of existing column from a that's what I mean by columns of substance election so then really rough idea is you want to select columns from the data matrix so that those columns are as orthogonal as possible, that will be one way to do matrix getting.
73
00:12:03.420 --> 00:12:15.720
Bei Wang: So this is our pipeline, we started with a bunch of birch trees, if I have a way to convert the information that is contained each of the merge tree into a high dimensional vector.
74
00:12:16.980 --> 00:12:29.550
Bei Wang: Then I can just formulated my data set of March trees into a matrix and then I just apply my classic matrix sketching techniques, you have the entire library of various techniques, you can do.
75
00:12:30.030 --> 00:12:37.710
Bei Wang: And once you actually have obtained a matrix you can turn each of the approximation, for example, now we know that he had.
76
00:12:38.100 --> 00:12:57.120
Bei Wang: His approximation of a so I can take each of those column of a hat and converted to a tree and I close, that to be the sketched merge tree and then of course the basis vectors are going to be comforted back to a tree as well, and I call that the basis tree the basis.
77
00:12:58.830 --> 00:13:08.760
Bei Wang: So how do we do that on high level um there's a few technical ingredients that if I have time i'll go into detail, but the main idea is really the step one and two.
78
00:13:09.300 --> 00:13:22.980
Bei Wang: For this pipeline is we're going to represent each of those mail merge tree as a metric measure network that's the first step, a metric measure network it's also been connected to what's called or more ostroff.
79
00:13:23.640 --> 00:13:34.320
Bei Wang: Was world was Washington framework and then, once i'm kind of mapping into a metric measure network, I also want to map them to a column vector.
80
00:13:34.650 --> 00:13:48.240
Bei Wang: into the data matrix so really at the core of our technique is a step one to how do I convert emerged tree into a high dimensional vector such that it's preserved is underlying structure and they have to be the same size.
81
00:13:49.410 --> 00:13:55.500
Bei Wang: And then the following, you know after I applied matrix sketching the falling framework of try to convert.
82
00:13:55.770 --> 00:14:07.380
Bei Wang: A high dimensional vector back to a tree, this is done through what's called spanning tree, so I can apply a spanning tree to this high dimensional vector and then there's two types of spanning tree.
83
00:14:07.680 --> 00:14:22.590
Bei Wang: There is what's called minimum spanning tree, which is something people familiar with and there's also other type of spanning tree called low stretch spanning tree, which is try to find a spanning tree that introduced the minimal amount of metric distortion so.
84
00:14:23.670 --> 00:14:32.010
Bei Wang: And i'm giving my talk sort of upside down because before I go into technical detail actually want to show you the result, how well does this framework work.
85
00:14:32.850 --> 00:14:43.020
Bei Wang: So, here it is example, so in this case I have 12 instances on the left, each one of them is a scale or field Okay, and is a picture on the right.
86
00:14:43.440 --> 00:14:53.250
Bei Wang: Is my corresponding merge tree they're responding to each of those instances, so now what I have is, I have this collection of mercury giving me as input.
87
00:14:53.790 --> 00:14:58.770
Bei Wang: And my question is that if I were going to apply the matrix sketching type of techniques to this.
88
00:14:59.400 --> 00:15:07.710
Bei Wang: What do I get as the basis and in this particular case i'm going to set K equal to too many that i'm going to choose, only two basis.
89
00:15:08.040 --> 00:15:14.520
Bei Wang: OK and i'm going to use column selection so, which means that amount of basis baxter i'm going to choose a model original.
90
00:15:14.940 --> 00:15:21.420
Bei Wang: amount of original trees, so what you see here is that what is in close by this Green Box.
91
00:15:22.200 --> 00:15:33.870
Bei Wang: is in fact the chosen basis and, if you look at the structure of it it's actually capture is a good representation of the underlying data right, and this is sort of like a high level picture.
92
00:15:34.290 --> 00:15:38.850
Bei Wang: You know, because if you look at all those merge trees, they are very simple, this is a very simple sensitive example.
93
00:15:39.600 --> 00:15:44.940
Bei Wang: What I showed here is to say that well if I look at all the structures, what as a representative structures.
94
00:15:45.510 --> 00:15:57.780
Bei Wang: there's what structure that is have to needs and then there's a structure that has three months and almost everybody else looks very similar to those two structures, so this those are the basis trees that is children.
95
00:15:58.590 --> 00:16:07.650
Bei Wang: If I go back to the coefficient matrices remember, I have a coefficient matrices what coefficient matrix is telling me if I have a basis tree one and.
96
00:16:08.160 --> 00:16:18.870
Bei Wang: Zero and basis tree one is told me, I mean, of course, in this case, yellow means high value and blue means low value it told me, what is the coefficient associated with each of the other tree.
97
00:16:19.290 --> 00:16:22.710
Bei Wang: With respect to the basis tree, so you can see here.
98
00:16:23.280 --> 00:16:39.210
Bei Wang: trees three it's not a basis tree, but if i'm using if i'm looking at it trees three is right here, this is a coefficient it says that most of the ways is to what the first basis, and that means that this is a sketch the result of this of this mercury.
99
00:16:40.320 --> 00:16:44.880
Bei Wang: And then to create this is the original tree in blue box.
100
00:16:45.330 --> 00:16:56.760
Bei Wang: And this is a sketch tree in red box because i'm using matrix sketching, so there is going to be some approximation going on, so you see that there's a little bit difference between the structures between those two trees.
101
00:16:57.480 --> 00:17:07.110
Bei Wang: And finally, if you look at tree nine which is here tree nine is right here this particular color if you look at it, it says that most of the coefficient is for the second basis.
102
00:17:07.380 --> 00:17:17.370
Bei Wang: And then, this is how the tree looks like before and after sketching so remember, I want to approximate every single input tree as a leader combination of the basis, and this is how it looks like.
103
00:17:18.420 --> 00:17:25.050
Bei Wang: Okay, so this is great, this means that I could choose to basis to represent the entire collection of.
104
00:17:25.530 --> 00:17:31.770
Bei Wang: trees and where everybody else is just linear combination of those two trees, so that is using column selection.
105
00:17:32.400 --> 00:17:37.440
Bei Wang: there's another way to do matrix sketching which is not call him selection, which is called.
106
00:17:37.860 --> 00:17:46.110
Bei Wang: An mf, but what is empathy is now matrix factorization So what is really mean is that i'm going to factor out my matrix.
107
00:17:46.440 --> 00:17:57.810
Bei Wang: And where my basis is something brand new it's not coming from the original input column there's those are coming from you know some matrix factorization and I can again turn it back to a tree and if i'm using.
108
00:17:58.230 --> 00:18:08.040
Bei Wang: other type of matrix sketching in this particular case um it's it's it's matrix factorization I will also get I also specify only want to basis.
109
00:18:08.340 --> 00:18:21.990
Bei Wang: Those other trees that the opposite return and What it does is, if you look at those structures, even though those two bases are brand new trees that matrix sketching is trying to find from matrix factorization.
110
00:18:22.530 --> 00:18:31.620
Bei Wang: is actually looks very similar to the trees that I use some other different methods, for example, in this case, this is by selecting this two trees from the original import.
111
00:18:31.950 --> 00:18:34.710
Bei Wang: Right, if you look at the structure is to leave and three leaf.
112
00:18:35.310 --> 00:18:51.390
Bei Wang: And in this case again I have to leave and streaming so is this is actually right quite interesting I sense that you have a lot of flexibility, how you choose the matrix sketching techniques and it's going to try to find again are good representation patients of the underlying data.
113
00:18:52.650 --> 00:18:58.440
Bei Wang: How does this work for real world data set if you have, I have three scientific a simulation data set.
114
00:18:59.280 --> 00:19:07.140
Bei Wang: The first one is again a flow data it's called heated cylinder so you put some heat and then there are some flow going from top to up.
115
00:19:07.590 --> 00:19:16.470
Bei Wang: This is a corner flow that goes this way, and then, this one is a piece of assimilation in the ocean, which is called Red Sea simulation so again in the flow simulation.
116
00:19:16.920 --> 00:19:28.110
Bei Wang: So for each of those data said this one has 31 instances in my ensemble this one has 100 instances in my assemble and this one has 60 instances in my sample.
117
00:19:28.470 --> 00:19:36.600
Bei Wang: So I want to like I would like to take this each one of those give me a set of merge trees, I would like to look at what the basis trees looks like.
118
00:19:38.040 --> 00:19:48.210
Bei Wang: So, again, there are three types of matrix sketching we do one is an mf which is non negative matrix factorization one is called column selection.
119
00:19:48.750 --> 00:19:53.370
Bei Wang: On there's two types of column selection The first one is called least squares in.
120
00:19:53.970 --> 00:20:02.070
Bei Wang: The square sample which basically sample the collar based on the square of their norm and you try to find you know, without replacement.
121
00:20:02.310 --> 00:20:08.880
Bei Wang: that's the first one, the second one is called iterative feature selection, where you start with a number of random selection.
122
00:20:09.210 --> 00:20:21.030
Bei Wang: And you kind of perfecting them, based on the residual so both of them are sort of random selection kind of scenario where you just sample the column essentially based on them some form of importance.
123
00:20:22.080 --> 00:20:31.590
Bei Wang: And you know, on a very high levels those two matrix kitchen techniques that if i'm using rss or if s you're basically for me basis from existing columns.
124
00:20:31.920 --> 00:20:36.330
Bei Wang: And then, from an mf you're just creating new columns based on matrix factorization.
125
00:20:37.110 --> 00:20:52.830
Bei Wang: That, and we have to use non negative matrix factorization is because in a few slides, you will know, the way we convert a birch tree to a vector is by looking at pairwise distances and that we wanted that Paris distances to be positive, or non negative.
126
00:20:54.210 --> 00:20:58.500
Bei Wang: Okay, so, in practice, what happened is that you.
127
00:20:59.160 --> 00:21:11.820
Bei Wang: You know there's always this awesome question about how you choose K, what is the right number of faces the way you do it is you kind of plot this sketching arrow right there's arrow coming from matrix sketching or you basic find the elbow method.
128
00:21:12.600 --> 00:21:19.200
Bei Wang: So look at the convergence plot and find the place where the optimal one you want the there's also.
129
00:21:19.860 --> 00:21:32.010
Bei Wang: sort of this convergence kind of slow down, so in this case, for the first date I said you get three basis, the second one get 15 the third one, no matter what you do you don't get perfect result you get.
130
00:21:32.550 --> 00:21:40.980
Bei Wang: 30 basis okay so so for the first data said, which is this heated flow, this is how the matrix it looks like.
131
00:21:42.420 --> 00:21:49.740
Bei Wang: This is one you want to focus on this is my coefficient matrix what is this thing is that i'm going to choose three basis.
132
00:21:50.130 --> 00:22:00.900
Bei Wang: And the coefficient is showing me those kind of block structure which basically says that using three basis, I can pretty much represent my underline collection very well.
133
00:22:01.770 --> 00:22:13.230
Bei Wang: But if you just use it to you see this area where you have to do this, like you know balanced combination it just means that was to you haven't got to the better scheduled result yet so needs three.
134
00:22:13.860 --> 00:22:20.880
Bei Wang: So, how does it looks like well i'm going to skip this slide, this is a basis what if choose is going to choose structures.
135
00:22:21.300 --> 00:22:25.440
Bei Wang: Why here here's the first tree as a basis is a second one, they serve one.
136
00:22:25.950 --> 00:22:31.170
Bei Wang: And once they are good representatives is because we look at the structure of those three merge trees.
137
00:22:31.440 --> 00:22:35.580
Bei Wang: there's minor structural differences and that's what this sketching is picking out.
138
00:22:35.820 --> 00:22:47.790
Bei Wang: is trying to find representatives of underlying topology in this particular case is say well there's a basis here there's a basis here, and you can look at here there's two new nodes that showed up in the sketch tree.
139
00:22:48.060 --> 00:22:58.440
Bei Wang: And you know if you look at if you zoom into the actual scare field this corresponding to this pair of new critical point that show up in that particular time instance, so the takeaway.
140
00:22:59.010 --> 00:23:06.690
Bei Wang: picture here is that if I choose three basis, it has a good representation of on the line up topological changes up the entire collection.
141
00:23:07.470 --> 00:23:18.420
Bei Wang: But then the main question you want to ask is for every single input tree, how does is approximation looks like remember every single input tree it's not a linear combination of those three basis tree.
142
00:23:18.930 --> 00:23:28.050
Bei Wang: And this is again, the picture is showing the top is the original tree and the bottom is a sketch tree you see that 47 is almost non distinguishable.
143
00:23:28.380 --> 00:23:42.150
Bei Wang: 438 there may be a little bit arrow here, for example, right around this area, you know the sketching result is not perfect right, so you saw approximation, but on a larger scale is approximate overall structure same thing was true.
144
00:23:43.050 --> 00:23:57.660
Bei Wang: You know, you see, roughly there's almost the same except maybe around here there's a little bit changes in functionality Okay, so we can do this for all the real world data saw which i'm not going to have time for so i'm going to fast forward.
145
00:23:58.830 --> 00:24:03.390
Bei Wang: And there's a bunch of real world data side that is very exciting, but let me go back to the main.
146
00:24:03.870 --> 00:24:16.320
Bei Wang: framework, how do we do this in order to do the sketching you need to convert every single birch tree to be the same size, want to be a high dimensional vector they have to have the same size but most importantly.
147
00:24:16.860 --> 00:24:28.020
Bei Wang: There has to be a meaningful structure correspondences among the dimension of this high dimensional vector okay so to do that the really there's three key components in the whole framework.
148
00:24:28.500 --> 00:24:32.070
Bei Wang: Is that we want to represent each mercury as a measure network.
149
00:24:32.460 --> 00:24:39.480
Bei Wang: That preserve these underlying structure and the second thing is that we would like to align every single emerge tree to their average.
150
00:24:39.660 --> 00:24:48.690
Bei Wang: So in this case, I can guarantee, they have the same size, because if I computer average tree among all of them, and every single input tree is mapped to this average tree.
151
00:24:49.500 --> 00:24:53.460
Bei Wang: Then you know they they can be mapped up the same size, which is the size of the average tree.
152
00:24:54.270 --> 00:25:01.230
Bei Wang: And finally, you can linear rise, the weight matrices associated with each of the smarts into a high dimensional vector and that's what we do.
153
00:25:01.860 --> 00:25:11.280
Bei Wang: And at the core of this is basically the measure network and the coupling between them, so this is a high level picture if I have to merge tree q1 q2.
154
00:25:11.730 --> 00:25:16.020
Bei Wang: They are different sizes one has eight notes and one has six nodes.
155
00:25:16.350 --> 00:25:25.980
Bei Wang: In order for them to be mapped to the same size, they all map to something called their average tree, so this is a tea hot is actually our average tree, you can see that.
156
00:25:26.250 --> 00:25:36.270
Bei Wang: Is it what's smart about Jihad is to had actually duplicate some of those nodes and you know and then there's a coupling between those those.
157
00:25:36.780 --> 00:25:47.670
Bei Wang: inputs to this average tree, so that you know when he arrives a Why make them into a vector they are all of the same size of the average trees vector representation so.
158
00:25:49.770 --> 00:26:04.800
Bei Wang: How do we preserve structure, if I take a merge tree, I would like to preserve underlying structure i'm going to represent it as a measure network, so it has a set of vectors and has a weight in is a probability over the nodes.
159
00:26:06.120 --> 00:26:18.180
Bei Wang: And then the most important part is that this weight has to represent the structure of the tree, so the weight w matrix has to encode information in the function that has come, where this mercury is coming from.
160
00:26:18.600 --> 00:26:28.740
Bei Wang: So this is actually quite easy in our framework what we did is we say for any two points in my mercury their weight between the edge between the.
161
00:26:29.250 --> 00:26:41.970
Bei Wang: Either absolute value of their functionality difference and for any two nodes that is far away from each other, their weight is the shortest path distance so that's how I encode emerge tree into a way to matrix.
162
00:26:43.260 --> 00:27:02.460
Bei Wang: So the second part, which I probably wouldn't have much, so this is how the weight matrix looks like this is the sort of weight matrix encoding of the mastery but I wouldn't have time to go into detail of how do we align them but it's actually coming from for going to work from.
163
00:27:04.410 --> 00:27:12.840
Bei Wang: And there's also work in 2016 and primarily both severe and Tom has this amazing working 2020 that talks about.
164
00:27:13.350 --> 00:27:19.590
Bei Wang: You know the score mob Washington distance for measure network so it's a very high level idea is you give me two of those.
165
00:27:19.950 --> 00:27:26.610
Bei Wang: You know networks, I can do a probabilistic matching between them by searching through the comic set.
166
00:27:27.000 --> 00:27:35.100
Bei Wang: of coupling between the probability and measure define other network Oh, let me mention the probability measure that I use is that for every single node in the tree.
167
00:27:35.430 --> 00:27:39.810
Bei Wang: It has one over the size of the tree so everybody has the same probability.
168
00:27:40.200 --> 00:27:52.830
Bei Wang: But the idea is you try to find the best coupling that medium is a cost, where the cost has to do with how the edges are matching to one another, so this is an example of how to merge trees and merge into one other.
169
00:27:53.520 --> 00:28:00.600
Bei Wang: For example, so it was a rose are the vertices from the first tree the columns of The vertices from the secondary.
170
00:28:01.680 --> 00:28:11.820
Bei Wang: And if you look at the yellow yellow means high probability what it says is the note zero, which is a root of the first tree is matched with high probability of the note.
171
00:28:12.510 --> 00:28:23.760
Bei Wang: Five from secondary so that's corresponding to this particular entry and then, in this case the note, for which is this column note for from the first tree is mapped with some sort of.
172
00:28:24.660 --> 00:28:33.780
Bei Wang: Two different possibility to two notes in a secondary so it's mapped with slightly higher probability to note one and slightly lower probably need to know to zero.
173
00:28:34.140 --> 00:28:48.600
Bei Wang: So this is what the or the gw framework does is kind of find correspondences between the merge trees and once you have this correspondences this is exactly how we establish correspondences between different dimensions of the merchant.
174
00:28:49.050 --> 00:29:00.540
Bei Wang: So that's what you know really at the core of how to get the sketching framework to work is to convert my merge tree to a high dimensional vector of the same size, such that.
175
00:29:00.990 --> 00:29:08.850
Bei Wang: The dimension has correspondences and then this exactly worth correspondences is is by finding the probabilistic matching between the corresponding notes.
176
00:29:09.900 --> 00:29:11.700
Bei Wang: So this is the mathematical framework.
177
00:29:12.660 --> 00:29:24.420
Bei Wang: I don't have a detailed to go into but really when you actually looking at the coupling is looking at sort of the coupling between those two probability distribution and find the one that minimize the distortion so.
178
00:29:25.050 --> 00:29:34.080
Bei Wang: In a lot of those those framework is goes back again into the previous work by those folks to try to find probabilistic mention.
179
00:29:34.770 --> 00:29:43.110
Bei Wang: But the last pitch to make them to be up the same size is once you have this gw distance, which is the one that find the best possible matching that minimize the cost.
180
00:29:43.350 --> 00:29:49.470
Bei Wang: You can define what's called for shaming, this is sort of the average tree, this is the one that mean, am I distances to everybody.
181
00:29:49.740 --> 00:29:57.780
Bei Wang: And then, this is exactly for shaming is the main tree that we use to make sure every mercury is mapped to something of the same size, so that is.
182
00:29:58.200 --> 00:30:06.390
Bei Wang: The high level picture and in this particular case, you can see that if I if I match the first tree tweets average.
183
00:30:06.810 --> 00:30:20.760
Bei Wang: The first tree is of size eight, the average is a size 12 second trees aside six and average tree size 12 so what i'm going to do is through this matching the probabilistic matching i'm going to turn every input tree to be of size 12.
184
00:30:21.300 --> 00:30:24.300
Bei Wang: So that makes everybody have the same dimension so.
185
00:30:25.920 --> 00:30:33.300
Bei Wang: that's that for my talk, I know I kind of run through quite a bit of information, but there's a lot of interesting component in this.
186
00:30:33.720 --> 00:30:42.450
Bei Wang: there's some over ongoing work where of course you can you're not constrained to one type of topological descriptor you can do this for real graph contour tree more smell graphs.
187
00:30:43.170 --> 00:30:57.660
Bei Wang: But the most important part, which I think is exciting, is that you can use techniques that is coming from randomized linear algebra in this case matrix sketching and combine that to study sort of variability among underlying topology Amal collection so.
188
00:30:58.860 --> 00:30:59.820
Bei Wang: that's end of my talk.
189
00:31:04.500 --> 00:31:04.890
Matthew Kahle: Thanks.
190
00:31:09.870 --> 00:31:12.960
Matthew Kahle: Aaron chambers had her hand up hiring.
191
00:31:13.980 --> 00:31:18.690
Erin Chambers: Hello very nice talk um I wonder, and maybe I just need to go read the paper.
192
00:31:19.260 --> 00:31:31.170
Erin Chambers: try to understand this, how do you lift it to something like a regret for you explore, I mean I assume the tree is going to be easier than a graph with cycles what types of ideas, do you have that might work on the more complex objects.
193
00:31:31.920 --> 00:31:42.180
Bei Wang: yeah so so so so there's a few hacks we can do so, first of all um if you have a riff raff will enjoy this if you have a.
194
00:31:45.360 --> 00:31:56.280
Bei Wang: yeah let's see, this is a graph right um the first hack is that if I take the merge tree of the rebrand it will give me one version of the merge tree.
195
00:31:56.820 --> 00:32:09.600
Bei Wang: If you give me a merge tree of the rib graft in the other direction, it will give me another one right, so I can kind of if I just want to use my existing framework, I can represent each rebrand by a collection of to march trees okay so that's what.
196
00:32:09.660 --> 00:32:25.590
Bei Wang: I mean it's not very elegant the second one, you can do is to say what the same thing we did for the merge tree remember for the mercury, I said, you know for the distance between X and y they are wait between X and y is our function value difference.
197
00:32:26.820 --> 00:32:35.040
Bei Wang: Right, this is for me to try to preserve the underlying structure of it this generalized exactly to regress.
198
00:32:35.610 --> 00:32:47.160
Bei Wang: Without any problem Okay, and in fact I think my student was Ashley amy he was not doing merge request, but he was doing more smell graph, which is also a graph structure so for both of those.
199
00:32:47.370 --> 00:32:49.860
Bei Wang: You can encode functionality of the same way as before.
200
00:32:50.580 --> 00:33:02.460
Bei Wang: The only tricky thing you want to do is, perhaps, in practice, I mean to be fair, right now, the current framework if I just replace it underline merge tree by to regress it will give me some results.
201
00:33:02.880 --> 00:33:14.460
Bei Wang: it's not going to be great, but it will give me some results Okay, so the question is how do I perfecting it if I want to perfecting it I might want to include a little bit more inflammation in my.
202
00:33:15.000 --> 00:33:21.690
Bei Wang: In my coupling, so this is really the key point here I didn't go into detail, but we find the best coupling.
203
00:33:22.260 --> 00:33:27.420
Bei Wang: it's really just looking at the difference between the weights there's a loss function between the weight of the two edges.
204
00:33:27.780 --> 00:33:38.640
Bei Wang: So if I want to include more information, for example in, especially in Greek graph there's information about those boobs and so on, so forth, one way to do it more elegantly is to encode.
205
00:33:39.150 --> 00:33:52.650
Bei Wang: into the weights, how do I infer the weight of two things, for example, if i'm trying to match a branch to a loop maybe that's a huge penalty right and but if I matching a loop to loop.
206
00:33:52.740 --> 00:33:54.090
Matthew Kahle: Maybe that's a less penalty so.
207
00:33:54.090 --> 00:34:00.120
Bei Wang: there's more elegant way of encoding it again if you go back to how to construct this wait.
208
00:34:00.900 --> 00:34:12.090
Bei Wang: Straight this loss function so it's not just like wait differences but also maybe differences over the underlying homology so I can encode a little bit home illogical differences when the homology imagine loop to loop.
209
00:34:12.870 --> 00:34:19.200
Bei Wang: Good look to edge know right so so you can again, you can fold it into the optimization.
210
00:34:19.920 --> 00:34:32.070
Bei Wang: And then, then you get a much better thing but there's additional thing about read graph that is not completed soft is that if I give you a perilous distances between nodes in emerge tree.
211
00:34:32.670 --> 00:34:44.370
Bei Wang: If you give me pairwise distances, I can convert it from a matrix Paris distances back to a tree by looking at minimum spanning tree but that's not true for requests.
212
00:34:45.690 --> 00:35:03.210
Bei Wang: So, again, there are you need to do a little bit more to say, well, if I give you a pair with distances, assuming that Paris businesses is coming from a regrowth how do I reconstruct of riffraff out of it and that's Ashley is one place where it also require a bit more.
213
00:35:04.230 --> 00:35:04.710
Bei Wang: effort.
214
00:35:06.990 --> 00:35:07.860
Erin Chambers: got it thanks.
215
00:35:09.360 --> 00:35:13.710
Matthew Kahle: Thanks babe we have a couple of more questions, Daniel Scott has his hand up.
216
00:35:15.420 --> 00:35:25.500
Daniel Scott: hi that was super interesting Thank you um yeah, I guess, I was just wondering if you could give us and sorry if I just missed this a little a little more intuition about the.
217
00:35:25.980 --> 00:35:33.900
Daniel Scott: nature of the reconstruction errors that one gets out of this, so I mean, I guess, in the case where you have just the matrix.
218
00:35:34.890 --> 00:35:48.900
Daniel Scott: And you go to PCA, say, and then you reconstruct the matrix like we have a strong sense of what types of errors we're going to get on but then in terms of like translating these things back to trees, can you tell us a bit more about what the errors in the trees are going to be.
219
00:35:49.650 --> 00:35:50.670
Bei Wang: yeah so that's a really good.
220
00:35:50.670 --> 00:35:52.380
Bei Wang: question, this is my slides for it.
221
00:35:52.650 --> 00:36:02.460
Bei Wang: So if you actually look hard matrix sketching, this is an arrow that you are mostly familiar with right, I have matrix a I have a head I minimize for Venus more.
222
00:36:02.910 --> 00:36:18.660
Bei Wang: So this is what I call global sketching arrow and now for each respected column right, so I have Ai which is original column a I had is a sketch column, you can also do what's called column wise sketch arrow right, you can measure it.
223
00:36:19.800 --> 00:36:31.080
Bei Wang: And where if I some over the color wise that's my global wise so that is a matrix sketching but there you are right there is a problem here, because all the matrix sketching existing work.
224
00:36:31.470 --> 00:36:35.910
Bei Wang: does not guarantee that each individual tree is well sketched.
225
00:36:36.690 --> 00:36:46.710
Bei Wang: it's only guarantee that the song is minimized it does not so so, so this is really like this is actually where you know everything i'm showing you is actually experimental resolved, we know it's works.
226
00:36:47.370 --> 00:36:57.090
Bei Wang: I think there's a really big gap between individual sketch readout right So how do I, I mean, at least, right now, if I just use matrix sketching out of box.
227
00:36:57.600 --> 00:37:15.780
Bei Wang: I have no way to guarantee individual sketching results in some sense likes to say Oh, some of the but what I know is not all of them are bad because remember the global sketching is bounded so I know, most of them is good, so, for example, if you look at the experiment results, right here.
228
00:37:17.040 --> 00:37:22.950
Bei Wang: This is what this extra information is showing so this thing here this column.
229
00:37:24.240 --> 00:37:33.000
Bei Wang: This is showing how individual sketching looks like so each column is a particular tree and you look at, for example here, this is yellow so that means.
230
00:37:33.420 --> 00:37:37.980
Bei Wang: Everybody is pretty well sketched but this particular tree is not.
231
00:37:38.400 --> 00:37:52.470
Bei Wang: A you know it's not well sketch, because if you look at coefficient is like a sort of like a balanced coefficient among older basis trees so over or it's not a very good sketching so the other area that we look at is um.
232
00:37:55.410 --> 00:38:05.700
Bei Wang: Is a using the gw framework so again, those are sort of complementary way to measure the arrow So the first two are coming from column wise global sketch arrow.
233
00:38:06.150 --> 00:38:17.340
Bei Wang: The second one is, I want to look at animate wise gw gw is basically a measuring the distances between the sketch tree and the original trees through the gross Washington distance.
234
00:38:17.640 --> 00:38:30.420
Bei Wang: This is the one that measures, the distances between their technological structures, like how well they are reboot symbol each other, so we can again report it and we can also report, the global one, so go back to their picture.
235
00:38:32.070 --> 00:38:44.580
Bei Wang: This is a data set from the corner flow, the first row is a is you know the sort of individual measurement based on topological differences, the second one is individual differences, based on matrix differences.
236
00:38:44.910 --> 00:38:51.030
Bei Wang: And you can also see that typically if I have well sketched once that both of them are very small.
237
00:38:51.720 --> 00:38:58.350
Bei Wang: And if you have not so well sketch ones, for example, this one that both of those arrows are pretty high.
238
00:38:58.890 --> 00:39:13.650
Bei Wang: But establish a surgical understanding of those two is is is is pretty open like how, how do I, you know you know says there is a gap here right if my matrix is even if my that particular column that was let's put it this way.
239
00:39:14.190 --> 00:39:23.670
Bei Wang: Globally, if my global sketching is good, it does not guarantee individual sketching has you know is is in disarray good but individual.
240
00:39:24.060 --> 00:39:30.780
Bei Wang: sketch arrow is good, it does not necessarily guarantee that Mike sort of topological structure is well preserved right.
241
00:39:31.080 --> 00:39:46.710
Bei Wang: But the hope is as follows, is that, if I can encode enough information in my matrix vector representation of my merge tree, then the hope is that because i'm encoding the topology and geometry in the high dimensional vector then.
242
00:39:48.120 --> 00:40:01.320
Bei Wang: The hope is that the actually sketching give me a reasonable tree structure and Ashley was even with real world data, so this is how it looks like I mean you see it's actually very promising, because this is original tree it's actually quite complicated.
243
00:40:01.860 --> 00:40:11.220
Bei Wang: Right, it has all those branches and leaves and so on, this is a before and this is after you see the only differences between the actual topology is like a run here.
244
00:40:13.170 --> 00:40:20.640
Bei Wang: The relative height of those nodes are slightly different from each other, but the structure wise is actually quite good.
245
00:40:21.540 --> 00:40:31.470
Bei Wang: Right and you can see those you know you you, you have those little bit of you know, and in this is, I will say in this particular example right overall structure still look the same.
246
00:40:31.800 --> 00:40:37.890
Bei Wang: But I have like those little bit extra node and maybe a little bit like branches, those are using some sense artifact.
247
00:40:38.070 --> 00:40:46.740
Bei Wang: Through because I have to match everything to be the same size, so I need to duplicate notes and when i'm sketching I get a little bit extra noise in it.
248
00:40:47.220 --> 00:40:51.270
Bei Wang: So what I can say right now, it works really well in practice.
249
00:40:51.840 --> 00:40:58.410
Bei Wang: How to how to guarantee individual sketching arrow and individual topology preservation I think there's a still a pretty big gap.
250
00:40:58.680 --> 00:41:13.740
Bei Wang: Because even back in matrix sketching I don't think there's much richer to say i'm doing matrix sketching such that every single column is guaranteed to be well sketched I have not, at least not i'm not aware of working either space.
251
00:41:14.970 --> 00:41:15.450
Daniel Scott: Thank you.
252
00:41:16.650 --> 00:41:25.680
Matthew Kahle: Okay, and the interest of time we're gonna just have one more question I think today, Africa has had his hand up for a few minutes Africa.
253
00:41:26.310 --> 00:41:30.240
Facundo Memoli: eBay I have i'm I mean I say a question in a comment.
254
00:41:30.690 --> 00:41:34.770
Facundo Memoli: So I think those are question and by hearing earlier today.
255
00:41:35.370 --> 00:41:46.590
Facundo Memoli: about whether you will be able, or how you will be able to accommodate for rugrats I one thought in that direction to simply uncalled for a function value if you know that's something that.
256
00:41:47.640 --> 00:41:52.740
Facundo Memoli: Particularly matters here, because if you only put the metric as the weight.
257
00:41:54.300 --> 00:41:56.100
Facundo Memoli: you're kind of forgetting the actual.
258
00:41:56.250 --> 00:42:01.680
Facundo Memoli: function Volume I mean that's one one idea right, so you had this a loss function in.
259
00:42:02.040 --> 00:42:06.300
Facundo Memoli: The one could combine the metric part with function related part.
260
00:42:07.740 --> 00:42:20.250
Bei Wang: Yes, yes, actually so so so so one thing that what my my student was working looking at recently is that he has you know, like he has basic three terms, he has a way difference, you know.
261
00:42:21.300 --> 00:42:25.890
Bei Wang: I J K l he has a way difference with some parameter awful.
262
00:42:26.130 --> 00:42:26.910
Bei Wang: And he has.
263
00:42:26.940 --> 00:42:29.970
Facundo Memoli: The functionality differences okay excellent.
264
00:42:30.090 --> 00:42:36.210
Bei Wang: second term and then you can you imagine, of course, this is not the only way, but is this what you mean so.
265
00:42:36.930 --> 00:42:38.370
Bei Wang: he's doing a linear combination.
266
00:42:38.670 --> 00:42:42.780
Facundo Memoli: Right perfectly, and the second thing is a more more question so.
267
00:42:43.770 --> 00:42:53.310
Facundo Memoli: suppose you were working directly with ultra metrics basis right, I mean the matrices you obtain out of trees, if you don't put any intermediate nodes I mean if you only consider leaves.
268
00:42:54.000 --> 00:43:02.100
Facundo Memoli: In the fall the leaves are the same distance from the route, you get an ultimate to explain now alternative spaces are an example of Euclidean metrics places they can be embedded.
269
00:43:02.820 --> 00:43:14.220
Facundo Memoli: directly into some Euclidean space of high dimension, so in that case you know the matrix factorization would probably have the metric five the factorization type of ideas, would have some provable estimate.
270
00:43:15.270 --> 00:43:32.910
Facundo Memoli: For how much you the store the metric simply because of the spaces for To begin with, already embeddable into Euclidean space so throwing away, you know coordinate, so I easy suggestion to see if, in that setting one will be able to prove actually what is your, what are the other bones.
271
00:43:33.690 --> 00:43:42.480
Bei Wang: yeah that'd be really nice i'm the only part I was so basically you're saying that I just I can just take the otter matrix of the of the merge tree.
272
00:43:43.350 --> 00:43:47.970
Facundo Memoli: suggesting to represent lurched response alternative i'm saying that as a separate experiment simply to.
273
00:43:48.900 --> 00:43:58.470
Facundo Memoli: run alternative spaces for pipeline to see how much, are you making the case, simply because it's one case in which one would be able to write down, you know some bound that's.
274
00:43:59.340 --> 00:44:09.600
Bei Wang: yeah I mean I thought would be cool like I would really like to follow up a bit more my only thing is that, even if we do that on in the matrix sketching framework.
275
00:44:10.020 --> 00:44:19.140
Bei Wang: Your arrow bound is usually the global sketching arrow so again, we might have to say something about what is happening to individual.
276
00:44:20.040 --> 00:44:21.090
Facundo Memoli: automating space.
277
00:44:21.870 --> 00:44:31.860
Bei Wang: So so So if I just wrong matrix sketching as a black box, the existing matrix sketching framework we have does not give me individual.
278
00:44:33.090 --> 00:44:36.300
Bei Wang: metric space era bond so in Arizona individual.
279
00:44:36.360 --> 00:44:37.770
Facundo Memoli: column that's what I mean.
280
00:44:39.540 --> 00:44:46.800
Bei Wang: So they're still I still think there's a gap, because I even like I said, even if I can pound over or global sketching.
281
00:44:47.250 --> 00:44:57.030
Bei Wang: it's still going okay that easily you can say individual is going to be bounded by that because individual sound together is a global arrow but if I if you really want to guarantee individual.
282
00:44:57.630 --> 00:45:03.480
Bei Wang: sketching arrow, then I think there is a essentially a non trivial amount of work that needs to be done there.
283
00:45:06.720 --> 00:45:09.570
Bei Wang: yeah great I will certainly like to talk more about this.
284
00:45:11.130 --> 00:45:17.430
Matthew Kahle: Okay, thanks again Bay, for that very nice talk and thanks everyone for the questions let's think Bay again.
285
00:45:18.570 --> 00:45:27.480
Matthew Kahle: And i'm I apologize there's a couple more questions in the chat Bay if you if you want to take a look on on the way out and maybe you'll be able to.
286
00:45:28.380 --> 00:45:38.550
Matthew Kahle: connect with these people later or I think it's even finest people continue you in the zoom call after stop recording, but I just wanted to kind of call an official and today.
287
00:45:39.000 --> 00:45:40.860
Matthew Kahle: Given how late it is.