
Deqing Sun Brown University
Stefan Roth
TU Darmstadt
Michael J.Black
Brown University
Abstract
The accuracy of opticalflow estimation algorithms has been improving steadily as evidenced by results on the Middlebury opticalflow benchmark.The typical formula-tion,however,has changed little since the work of Horn and Schunck.We attempt to uncover what has made re-cent advances possible through a thorough analysis of how the objective function,the optimization method,and mod-ern implementation practices influence accuracy.We dis-cover that“classical”flow formulations perform surpris-ingly well when combined with modern optimization and implementation techniques.Moreover,wefind that while medianfiltering of intermediateflowfields during optimiza-tion is a key to recent performance gains,it leads to higher energy solutions.To understand the principles behind this phenomenon,we derive a new objective that formalizes the medianfiltering heuristic.This objective includes a non-local term that robustly integratesflow estimates over large spatial neighborhoods.By modifying this new term to in-clude information aboutflow and image boundaries we de-velop a method that ranks at the top of the Middlebury benchmark.
1.Introduction
Thefield of opticalflow estimation is making steady progress as evidenced by the increasing accuracy of cur-rent methods on the Middlebury opticalflow benchmark [6].After nearly30years of research,these methods have obtained an impressive level of reliability and accuracy [33,34,35,40].But what has led to this progress?The majority of today’s methods strongly resemble the original formulation of Horn and Schunck(HS)[18].They combine a data term that assumes constancy of some image property with a spatial term that models how theflow is expected to vary across the image.An objective function combin-ing these two terms is then optimized.Given that this basic structure is unchanged since HS,what has enabled the per-formance gains of modern approaches?
The paper has three parts.In thefirst,we perform an ex-tensive study of current opticalflow methods and models.The most accurate methods on the Middleburyflow dataset make different choices about how to model the objective function,how to approximate this model to make it com-putationally tractable,and how to optimize it.Since most published methods change all of these properties at once, it can be difficult to know which choices are most impor-tant.To address this,we define a baseline algorithm that is“classical”,in that it is a direct descendant of the original HS formulation,and then systematically vary the model and method using different techniques from the art.The results are surprising.Wefind that only a small number of key choices produce statistically significant improvements and that they can be combined into a very simple method that achieves accuracies near the state of the art.More impor-tantly,our analysis reveals what makes currentflow meth-ods work so well.
Part two examines the principles behind this success.We find that one algorithmic choice produces the most signifi-cant improvements:applying a medianfilter to intermedi-ateflow values during incremental estimation and warping [33,34].While this heuristic improves the accuracy of the recoveredflowfields,it actually increases the energy of the objective function.This suggests that what is being opti-mized is actually a new and different objective.Using ob-servations about medianfiltering and L1energy minimiza-tion from Li and Osher[23],we formulate a new non-local term that is added to the original,classical objective.This new term goes beyond standard local(pairwise)smoothness to robustly integrate information over large spatial neigh-borhoods.We show that minimizing this new energy ap-proximates the original optimization with the heuristic me-dianfiltering step.Note,however,that the new objective falls outside our definition of classical methods.
Finally,once the medianfiltering heuristic is formulated as a non-local term in the objective,we immediately recog-nize how to modify and improve it.In part three we show how information about image structure andflow boundaries can be incorporated into a weighted version of the non-local term to prevent over-smoothing across boundaries.By in-corporating structure from the image,this weighted version does not suffer from some of the errors produced by median filtering.At the time of publication(March2010),the re-sulting approach is ranked1st in both angular and end-point errors in the Middlebury evaluation.
In summary,the contributions of this paper are to(1)an-alyze currentflow models and methods to understand which design choices matter;(2)formulate and compare several classical objectives descended from HS using modern meth-ods;(3)formalize one of the key heuristics and derive a new objective function that includes a non-local term;(4)mod-ify this new objective to produce a state-of-the-art method. In doing this,we provide a“recipe”for others studying op-ticalflow that can guide their design choices.Finally,to en-able comparison and further innovation,we provide a public M ATLAB implementation[1].
2.Previous Work
It is important to separately analyze the contributions of the objective function that defines the problem(the model) and the optimization algorithm and implementation used to minimize it(the method).The HS formulation,for example, has long been thought to be highly inaccurate.Barron et al.
[7]reported an average angular error(AAE)of~30degrees on the“Yosemite”sequence.This confounds the objective function with the particular optimization method proposed by Horn and Schunck1.When optimized with today’s meth-ods,the HS objective achieves surprisingly competitive re-sults despite the expected over-smoothing and sensitivity to outliers.
Models:The global formulation of opticalflow intro-duced by Horn and Schunck[18]relies on both brightness constancy and spatial smoothness assumptions,but suffers from the fact that the quadratic formulation is not robust to outliers.Black and Anandan[10]addressed this by re-placing the quadratic error function with a robust formula-tion.Subsequently,many different robust functions have been explored[12,22,31]and it remains unclear which is best.We refer to all these spatially-discrete formulations derived from HS as“classical.”We systematically explore variations in the formulation and optimization of these ap-proaches.The surprise is that the classical model,appropri-ately implemented,remains very competitive.
There are many formulations beyond the classical ones that we do not consider here.Significant ones use oriented smoothness[25,31,33,40],rigidity constraints[32,33], or image segmentation[9,21,41,37].While they deserve similar careful consideration,we expect many of our con-clusions to carry forward.Note that one can select among a set of models for a given sequence[4],instead offinding a “best”model for all the sequences.
Methods:Many of the implementation details that are thought to be important date back to the early days of op-1They noted that the correct way to optimize their objective is by solv-ing a system of linear equations as is common today.This was impractical on the computers of the day so they used a heuristic method.ticalflow.Current best practices include coarse-to-fine es-timation to deal with large motions[8,13],texture decom-position[32,34]or high-orderfilter constancy[3,12,16, 22,40]to reduce the influence of lighting changes,bicubic interpolation-based warping[22,34],temporal averaging of image derivatives[17,34],graduated non-convexity[11]to minimize non-convex energies[10,31],and medianfilter-ing after each incremental estimation step to remove outliers [34].
This medianfiltering heuristic is of particular interest as it makes non-robust methods more robust and improves the accuracy of all methods we tested.The effect on the objec-tive function and the underlying reason for its success have not previously been analyzed.Least median squares estima-tion can be used to robustly reject outliers inflow estimation [5],but previous work has focused on the data term.
Related to medianfiltering,and our new non-local term, is the use of bilateralfiltering to prevent smoothing across motion boundaries[36].The approach separates a varia-tional method into twofiltering update stages,and replaces the original anisotropic diffusion process with multi-cue driven bilateralfiltering.As with medianfiltering,the bi-lateralfiltering step changes the original energy function.
Models that are formulated with an L1robust penalty are often coupled with specialized total variation(TV)op-timization methods[39].Here we focus on generic opti-mization methods that can apply to any model andfind they perform as well as reported results for specialized methods.
Despite recent algorithmic advances,there is a lack of publicly available,easy to use,and accurateflow estimation software.The GPU4Vision project[2]has made a substan-tial effort to change this and provides executablefiles for several accurate methods[32,33,34,35].The dependence on the GPU and the lack of source code are limitations.We hope that our public M ATLAB code will not only help in un-derstanding the“secrets”of opticalflow,but also let others exploit opticalflow as a useful tool in computer vision and relatedfields.
3.Classical Models
We write the“classical”opticalflow objective function in its spatially discrete form as
E(u,v)=
∑
i,j
{ρD(I1(i,j)−I2(i+u i,j,j+v i,j))(1)
+λ[ρS(u i,j−u i+1,j)+ρS(u i,j−u i,j+1)
+ρS(v i,j−v i+1,j)+ρS(v i,j−v i,j+1)]}, where u and v are the horizontal and vertical components of the opticalflowfield to be estimated from images I1and I2,λis a regularization parameter,andρD andρS are the data and spatial penalty functions.We consider three different penalty functions:(1)the quadratic HS penaltyρ(x)=x2;(2)the Charbonnier penaltyρ(x)=√
x2+ 2[13],a dif-
ferentiable variant of the L1norm,the most robust convex
function;and(3)the Lorentzianρ(x)=log(1+x2
2σ2),which
is a non-convex robust penalty used in[10].Note that this classical model is related to a standard pairwise Markov randomfield(MRF)based on a4-neighborhood.
In the remainder of this section we define a baseline method using several techniques from the literature.This is not the“best”method,but includes modern techniques and will be used for comparison.We only briefly describe the main choices,which are explored in more detail in the following section and the cited references,especially[30].
Quantitative results are presented throughout the remain-der of the text.In all cases we report the average end-point error(EPE)on the Middlebury training and test sets,de-pending on the experiment.Given the extensive nature of the evaluation,only average results are presented in the main body,while the details for each individual sequence are given in[30].
3.1.Baseline methods
To gain robustness against lighting changes,we follow [34]and apply the Rudin-Osher-Fatemi(ROF)structure texture decomposition method[28]to pre-process the in-put sequences and linearly combine the texture and struc-ture components(in the proportion20:1).The parameters are set according to[34].
Optimization is performed using a standard incremental multi-resolution technique(e.g.[10,13])to estimateflow fields with large displacements.The opticalflow estimated at a coarse level is used to warp the second image toward thefirst at the nextfiner level,and aflow increment is cal-culated between thefirst image and the warped second im-age.The standard deviation of the Gaussian anti-aliasing
filter is set to be1√
2d ,where d denotes the downsampling
factor.Each level is recursively downsampled from its near-est lower level.In building the pyramid,the downsampling factor is not critical as pointed out in the next section and here we use the settings in[31],which uses a factor of0.8 in thefinal stages of the optimization.We adaptively de-termine the number of pyramid levels so that the top level has a width or height of around20to30pixels.At each pyramid level,we perform10warping steps to compute the flow increment.
At each warping step,we linearize the data term,which
involves computing terms of the type∂
∂x I2(i+u k i,j,j+v k i,j),
where∂/∂x denotes the partial derivative in the horizon-tal direction,u k and v k denote the currentflow estimate at iteration k.As suggested in[34],we compute the deriva-tives of the second image using the5-point derivativefilter
1 12[−180−81],and warp the second image and its deriva-
tives toward thefirst using the currentflow estimate by bicu-bic interpolation.We then compute the spatial derivatives of
Avg.Rank Avg.EPE
Classic-C14.90.408
HS24.60.501
Classic-L19.80.530
HS[31]35.10.872
BA(Classic-L)[31]30.90.746
Adaptive[33]11.50.401
Complementary OF[40]10.10.485
Table1.Models.Average rank and end-point error(EPE)on the Middlebury test set using different penalty functions.Two current methods are included for comparison.
thefirst image,average with the warped derivatives of the second image(c.f.[17]),and use this in place of∂I2
∂x
.For pixels moving out of the image boundaries,we set both their corresponding temporal and spatial derivatives to zero.Af-ter each warping step,we apply a5×5medianfilter to the newly computedflowfield to remove outliers[34].
For the Charbonnier(Classic-C)and Lorentzian (Classic-L)penalty function,we use a graduated non-convexity(GNC)scheme[11]as described in[31]that lin-early combines a quadratic objective with a robust objective in varying proportions,from fully quadratic to fully robust. Unlike[31],a single regularization weightλis used for both the quadratic and the robust objective functions.
3.2.Baseline results
The regularization parameterλis selected among a set of candidate values to achieve the best average end-point error (EPE)on the Middlebury training set.For the Charbonnier penalty function,the candidate set is[1,3,5,8,10]and 5is optimal.The Charbonnier penalty uses =0.001for both the data and the spatial term in Eq.(1).The Lorentzian usesσ=1.5for the data term,andσ=0.03for the spa-tial term.These parameters arefixed throughout the exper-iments,except where mentioned.
Table1summarizes the EPE results of the basic model with three different penalty functions on the Middlebury test set,along with the two top performers at the time of publication(considering only published papers).The clas-sic formulations with two non-quadratic penalty functions (Classic-C)and(Classic-L)achieve competitive results de-spite their simplicity.The baseline optimization of HS and BA(Classic-L)results in significantly better accuracy than previously reported for these models[31].Note that the analysis also holds for the training set(Table2).
At the time of publication,Classic-C ranks13th in av-erage EPE and15th in AAE in the Middlebury benchmark despite its simplicity,and it serves as the baseline below.It is worth noting that the spatially discrete MRF formulation taken here is competitive with variational methods such as [33].Moreover,our baseline implementation of HS has a lower average EPE than many more sophisticated methods.Avg.EPE significance p-value Classic-C0.298——
HS0.38410.0078
Classic-L0.31910.0078
Classic-C-brightness0.28800.9453
HS-brightness0.38710.0078
Classic-L-brightness0.32500.2969
Gradient0.30500.4609
Table2.Pre-Processing.Average end-point error(EPE)on the Middlebury training set for the baseline method(Classic-C)using different pre-processing techniques.Significance is always with respect to Classic-C.
4.Secrets Explored
We evaluate a range of variations from the baseline ap-proach that have appeared in the literature,in order to illu-minate which may be of importance.This analysis is per-formed on the Middlebury training set by changing only one property at a time.Statistical significance is determined using a Wilcoxon signed rank test between each modified method and the baseline Classic-C;a p value less than0.05 indicates a significant difference.
Pre-Processing.For each method,we optimize the regu-larization parameterλfor the training sequences and report the results in Table2.The baseline uses a non-linear pre-filtering of the images to reduce the influence of illumina-tion changes[34].Table2shows the effect of removing this and using a standard brightness constancy model(*-brightness).Classic-C-brightness actually achieves lower EPE on the training set than Classic-C but significantly lower accuracy on the test set:Classic-C-brightness= 0.726,HS-brightness=0.759,and Classic-L-brightness =0.603–see Table1for comparison.This disparity sug-gests overfitting is more severe for the brightness constancy assumption.Gradient only imposes constancy of the gra-dient vector at each pixel as proposed in[12](i.e.it robustly penalizes Euclidean distance between image gradients)and has similar performance in both training and test sets(c.f. Table8).See[30]for results of more alternatives. Secrets:Some form of imagefiltering is useful but simple derivative constancy is nearly as good as the more sophisti-cated texture decomposition method.
Coarse-to-fine estimation and GNC.We vary the number of warping steps per pyramid level andfind that3warping steps gives similar results as using10(Table3).For the GNC scheme,[31]uses a downsampling factor of0.8for non-convex optimization.A downsampling factor of0.5 (Down-0.5),however,has nearly identical performance Removing the GNC step for the Charbonnier penalty function(w/o GNC)results in higher EPE on most se-quences and higher energy on all sequences(Table4).This suggests that the GNC method is helpful even for the con-vex Charbonnier penalty function due to the nonlinearity of
Avg.EPE significance p-value Classic-C0.298——
3warping steps0.30400.9688
Down-0.50.2980 1.0000
w/o GNC0.35400.1094
Bilinear0.30200.1016
w/o TA VG0.30600.1562
Central derivativefilter0.30000.7266
7-point derivativefilter[13]0.30200.3125
Bicubic-II0.29010.0391
GC-0.45(λ=3)0.29210.0156
GC-0.25(λ=0.7)0.2980 1.0000
MF3×30.30500.1016
MF7×70.30500.5625
2×MF0.3000 1.0000
5×MF0.30500.6875
w/o MF0.35210.0078
Classic++0.28510.0078 Table3.Model and Methods.Average end-point error(EPE)on the Middlebury training set for the baseline method(Classic-C) using different algorithm and modeling
choices.
Figure1.Different penalty functions for the spatial terms:Char-bonnier( =0.001),generalized Charbonnier(a=0.45and a=0.25),and Lorentzian(σ=0.03).
the data term.
Secrets:The downsampling factor does not matter when using a convex penalty;a standard factor of0.5isfine. Some form of GNC is useful even for a convex robust penalty like Charbonnier because of the nonlinear data term. Interpolation method and derivatives.Wefind that bicu-bic interpolation is more accurate than bilinear(Table3, Bilinear),as already reported in previous work[34].Re-moving temporal averaging of the gradients(w/o TA VG), using Central differencefilters,or using a7-point deriva-tivefilter[13]all reduce accuracy compared to the base-line,but not significantly.The M ATLAB built-in function interp2is based on cubic convolution approximation[20]. The spline-based interpolation scheme[26]is consistently better(Bicubic-II).See[30]for more discussions. Secrets:Use spline-based bicubic interpolation with a5-pointfilter.Temporal averaging of the derivatives is proba-bly worthwhile for a small computational expense. Penalty functions.Wefind that the convex Charbonnier penalty performs better than the more robust,non-convex Lorentzian on both the training and test sets.One reason might be that non-convex functions are more difficult to op-timize,causing the optimization scheme tofind a poor local
(a)
With medianfiltering(b)Without medianfiltering
Figure2.Estimatedflowfields on sequence“RubberWhale”using Classic-C with and without(w/o MF)the medianfiltering step. Color coding as in[6].(a)(w/MF)energy502,387and(b)(w/o MF)energy449,290.The medianfiltering step helps reach a so-lution free from outliers but with a higher energy.
optimum.We investigate a generalized Charbonnier penalty functionρ(x)=(x2+ 2)a that is equal to the Charbon-nier penalty when a=0.5,and non-convex when a<0.5 (see Figure1).We optimize the regularization parameterλagain.Wefind a slightly non-convex penalty with a=0.45 (GC-0.45)performs consistently better than the Charbon-nier penalty,whereas more non-convex penalties(GC-0.25 with a=0.25)show no improvement.
Secrets:The less-robust Charbonnier is preferable to the Lorentzian and a slightly non-convex penalty function(GC-0.45)is better still.
Medianfiltering.The baseline5×5medianfilter(MF 5×5)is better than both MF3×3[34]and MF7×7but the difference is not significant(Table3).When we perform5×5medianfiltering twice(2×MF)orfive times(5×MF)per warping step,the results are worse.Finally,removing the medianfiltering step(w/o MF)makes the computedflow significantly less accurate with larger outliers as shown in Table3and Figure2.
Secrets:Medianfiltering the intermediateflow results once after every warping iteration is the single most important secret;5×5is a goodfilter size.
4.1.Best Practices
Combining the analysis above into a single approach means modifying the baseline to use the slightly non-convex generalized Charbonnier and the spline-based bicu-bic interpolation.This leads to a statistically significant improvement over the baseline(Table3,Classic++).This method is directly descended from HS and BA,yet updated with the current best optimization practices known to us. This simple method ranks9th in EPE and12th in AAE on the Middlebury test set.
5.Models Underlying Median Filtering
Our analysis reveals the practical importance of median filtering during optimization to denoise theflowfield.We ask whether there is a principle underlying this heuristic?
One interesting observation is thatflowfields obtained with medianfiltering have substantially higher energy than those without(Table4and Figure2).If the medianfilter is helping to optimize the objective,it should lead to lower energies.Higher energies and more accurate estimates sug-gest that incorporating medianfiltering changes the objec-tive function being optimized.
The insight that follows from this is that the medianfil-tering heuristic is related to the minimization of an objective function that differs from the classical one.In particular the optimization of Eq.(1),with interleaved medianfiltering, approximately minimizes
E A(u,v,ˆu,ˆv)=(2)
∑
i,j
{
ρD(I1(i,j)−I2(i+u i,j,j+v i,j))
+λ[ρS(u i,j−u i+1,j)+ρS(u i,j−u i,j+1)+
ρS(v i,j−v i+1,j)+ρS(v i,j−v i,j+1)]
}
+λ2(||u−ˆu||2+||v−ˆv||2)
+
∑
i,j
∑
(i ,j )∈N i,j
λ3(|ˆu i,j−ˆu i ,j |+|ˆv i,j−ˆv i ,j |),
whereˆu andˆv denote an auxiliaryflowfield,N i,j is the set of neighbors of pixel(i,j)in a possibly large area andλ2 andλ3are scalar weights.The term in braces is the same as theflow energy from Eq.(1),while the last term is new. This non-local term[14,15]imposes a particular smooth-ness assumption within a specified region of the auxiliary flowfieldˆu,ˆv2.Here we take this term to be a5×5rectan-gular region to match the size of the medianfilter in Classic-C.A third(coupling)term encouragesˆu,ˆv and u,v to be the same(c.f.[33,39]).
The connection to medianfiltering(as a denoising method)derives from the fact that there is a direct relation-ship between the median and L1minimization.Consider a simplified version of Eq.(2)with just the coupling and non-local terms,where E(ˆu)=
λ2||u−ˆu||2+
∑
i,j
∑
(i ,j )∈N i,j
λ3|ˆu i,j−ˆu i ,j |.(3)
While minimizing this is similar to medianfiltering u,there are two differences.First,the non-local term minimizes the L1distance between the central value and allflow values in its neighborhood except itself.Second,Eq.(3)incorpo-rates information about the data term through the coupling equation;medianfiltering theflow ignores the data term.
The formal connection between Eq.(3)and medianfil-tering3is provided by Li and Osher[23]who show that min-2Bruhn et al.[13]also integrated information over a local region in a global method but did so for the data term.
3Hsiao et al.[19]established the connection in a slightly different way.
Venus Dimetrodon Hydrangea RubberWhale
Grove2Grove3Urban2Urban3Classic-C 0.50.7480.8660.502 1.816 2.317 1.126 1.424w/o GNC 0.5930.7500.8700.506 1.845 2.518 1.142 1.465w/o MF
0.5170.7010.6680.449 1.418 1.830 1.066 1.395
Table 4.Eq.(1)energy (×106)for the optical flow fields computed on the Middlebury training set .Note that Classic-C uses graduated non-convexity (GNC),which reduces the energy,and median filtering,which increases it.
imizing Eq.(3)is related to a different median computation
ˆu (k +1)
i,j
=median (Neighbors (k )∪Data)
(4)
where Neighbors (k )={ˆu (k )
i ,j }for (i ,j )∈N i,j and ˆu (0)=u as well as Data ={u i,j ,u i,j ±
λ3
λ2,u i,j
±
2λ3λ2
···,u i,j ±
|N i,j |λ3
2λ2},
where |N i,j |denotes the (even)number of neighbors of (i,j ).Note that the set of “data”values is balanced with an equal number of elements on either side of the value u i,j and that information about the data term is included through u i,j .Repeated application of Eq.(4)converges rapidly [23].Observe that,as λ3/λ2increases,the weighted data val-ues on either side of u i,j move away from the values of Neighbors and cancel each other out.As this happens,Eq.(4)approximates the median at the first iteration
ˆu (1)
i,j ≈median (Neighbors (0)∪{u i,j }).
(5)
Eq.(2)thus combines the original objective with an ap-proximation to the median,the influence of which is con-trolled by λ3/λ2.Note in practice the weight λ2on the
coupling term is usually small or is steadily increased from small values [34,39].We optimize the new objective (2)by alternately minimizing
E O (u ,v )=∑
i,j
ρD (I 1(i,j )−I 2(i +u i,j ,j +v i,j ))
+λ[ρS (u i,j −u i +1,j )+ρS (u i,j −u i,j +1)+ρS (v i,j −v i +1,j )+ρS (v i,j −v i,j +1)]+λ2(||u −ˆu ||2+||v −ˆv ||2)
(6)
and
E M (ˆu ,ˆv )=λ2(||u −ˆu ||2+||v −ˆv ||2)(7)
+∑i,j ∑(i ,j )∈N i,j
λ3(|ˆu i,j −ˆu i ,j |+|ˆv i,j −ˆv i ,j |).
Note that an alternative formulation would drop the cou-pling term and impose the non-local term directly on u and v .We find that optimization of the coupled set of equations is superior in terms of EPE performance.
The alternating optimization strategy first holds ˆu ,ˆv fixed and minimizes Eq.(6)w.r.t.u ,v .Then,with u ,v fixed,we minimize Eq.(7)w.r.t.ˆu ,ˆv .Note that Eqs.(3)and
Avg.EPE significance
p -value Classic-C
0.298——Classic-C-A
0.30500.8125
Table 5.Average end-point error (EPE)on the Middlebury train-ing set is shown for the new model with alternating optimization (Classic-C-A ).
(7)can be minimized by repeated application of Eq.(4);we
use this approach with 5iterations.We perform 10steps of alternating optimizations at every pyramid level and change λ2logarithmically from 10−4to 102.During the first and second GNC stages,we set u ,v to be ˆu ,ˆv after every warp-ing step (this step helps reach solutions with lower energy and EPE [30]).In the end,we take ˆu ,ˆv as the final flow field estimate.The other parameters are λ=5,λ3=1.Alternatingly optimizing this new objective function (Classic-C-A )leads to similar results as the baseline Classic-C (Table 5).We also compare the energy of these solutions using the new objective and find the alternat-ing optimization produces the lowest energy solutions,as shown in Table 6.To do so,we set both the flow field u ,v and the auxiliary flow field ˆu ,ˆv to be the same in Eq.(2).In summary,we show that the heuristic median filter-ing step in Classic-C can now be viewed as energy min-imization of a new objective with a non-local term.The explicit formulation emphasizes the value of robustly inte-grating information over large neighborhoods and enables the improved model described below.
6.Improved Model
By formalizing the median filtering heuristic as an ex-plicit objective function,we can find ways to improve it.While median filtering in a large neighborhood has advan-tages as we have seen,it also has problems.A neighborhood centered on a corner or thin structure is dominated by the surround and computing the median results in oversmooth-ing as illustrated in Figure 3(a).
Examining the non-local term suggests a solution.For a given pixel,if we know which other pixels in the area be-long to the same surface,we can weight them more highly.The modification to the objective function is achieved by introducing a weight into the non-local term [14,15]:∑i,j ∑
(i ,j )∈N i,j
w i,j,i ,j (|ˆu i,j −ˆu i ,j |+|ˆv i,j −ˆv i ,j |),(8)
where w i,j,i ,j represents how likely pixel i ,j is to belong
to the same surface as i,j .
Venus Dimetrodon Hydrangea RubberWhale Grove2Grove3Urban2Urban3 Classic-C0.8170.903 1.2020.674 2.166 3.144 1.954 2.153 Classic-C w/o MF0.8860.945 1.2990.725 2.315 3.513 2.234 2.712 Classic-C-A0.7840.8 1.1390.666 2.0 2.976 1.922 2.049
Table
6.Eq.(2)energy(×106)for the computedflowfields on the Middlebury training set.The alternating optimization strategy(Classic-C-A)produces the lowest energy solutions.
(a)(b)
Figure3.Medianfiltering over-smoothes the rifle in the“Army”
sequence,while the proposed weighted non-local term preserves
the detail.Results of(a)Classic++(b)Classic+NL.
Of course,we do not know w i,j,i ,j ,but can approximate
it.We draw ideas from[29,36,38]to define the weights ac-
cording to their spatial distance,their color-value distance,
and their occlusion state as w i,j,i ,j ∝
exp
{
−|i−i |2+|j−j |2
2σ2
1
−|I(i,j)−I(i ,j )|2
2σ2
2
}o(i ,j )
o(i,j)
,(9)
where the occlusion variable o(i,j)is calculated using
Eq.(22)in[29],I(i,j)is the color vector in the Lab space,
andσ1=7,σ2=7.Examples of such weights are shown
for several15×15neighborhoods in Figure4;bright val-
ues indicate higher weights.Note the neighborhood labeled
d,corresponding to the rifle.Since pixels on the rifle are
in the minority,an unweighted median would oversmooth.
The weighted term instead robustly estimates the motion
using values on the rifle.A closely related piece of work is
[27],which uses the intervening contour to define affinities
among neighboring pixels for the local Lucas and Kanade
[24]method.However it only uses this scheme to estimate
motion for sparse points and then interpolates the dense
flowfield.
We approximately solve Eq.(9)forˆu,ˆv as the following
weighted median problem
min
ˆu i,j
∑
(i ,j )∈N i,j∪{i,j}
w i,j,i ,j |ˆu i,j−u i ,j |,(10)
using the formula(3.13)in[23]for all the pixels
(Classic+NL-Full).Note if all the weights are equal,the
solution is just the median.In practice,we can adopt a fast
version(Classic+NL)without performance loss.Given a
current estimate of theflow,we detect motion boundaries
using a Sobel edge detector and dilate these edges with a
5×5mask to obtainflow boundary regions.In these re-
gions we use the weighting in Eq.(9)in a15×15neighbor-
hood.In the non-boundary regions,we use equal weights in
a5×5neighoborhood to compute the median.
(a)(b)(c)(d)(e)
Figure4.Neighbor weights of the proposed weighted non-local
term at different positions in the“Army”sequence.
Avg.EPE significance p-value
Classic+NL0.221——
Classic+NL-Full0.22200.8203
Table7.Average end-point error(EPE)on the Middlebury training
set is shown for the fast and full versions of the improved model.
Avg.Rank Avg.EPE
Classic++13.40.406
Classic++Gradient15.10.430
Classic+NL 6.20.319
Classic+NL-Full 6.60.316
Table8.Average end-point error(EPE)on the Middlebury test set
for the Classic++model with two different preprocessing tech-
niques and its improved model.
Tables7and8show that the weighted non-local term
(Classic+NL)improves the accuracy on both the training
and the test sets.Note that thefine detail of the“rifle”is
preserved in Figure3(b).At the time of publication,Clas-
sic+NL ranks1st in both AAE and EPE in the Middlebury
evaluation and has the lowest average AAE and EPE among
all listed algorithms.The running time on the test“Ur-
ban”sequence is about70minutes for Classic+NL-Full
and about16miniutes for Classic+NL in M ATLAB.
7.Conclusions
Implemented using modern practices,classical optical
flow formulations produce competitive results on the Mid-dlebury training and test sets.To understand the“secrets”that help such basic formulations work well,we quantita-tively studied various aspects offlow approaches from the literature,including their implementation details.Among the good practices,we found that using medianfiltering to denoise theflow after every warping step is key to improv-ing accuracy,but that it increases the energy of thefinal re-sult.Exploiting connections between medianfiltering and L1-based denoising,we showed that algorithms relying on a medianfiltering step are approximately optimizing a differ-ent objective that regularizesflow over a large spatial neigh-borhood.This principle enables us to design and optimize improved models that weight the neighbors adaptively in an extended image region.At the time of publication(March 2010),the resulting algorithm ranks1st in both angular and end-point errors in the Middlebury evaluation.The M AT-LAB code is publicly available[1].
How far can the2-frame classical methods be pushed? Our sense is that they are likely to improve incrementally for several years to come,but that the big gains will come from methods that go beyond the classical formulation to reason more explicitly about surfaces and boundaries and how they move over time.
Acknowledgments.DS and MJB were supported by a gift from Intel Corp.and NSF CRCNS award IIS-0904875.We thank the reviewers for constructive comments,especially the connec-tion between our original“area”term and non-local regulariza-tion,P.Yadollahpour for his early work on implementing the HS method,S.Zuffifor suggesting the color version of the non-local term,T.Brox,A.Wedel,and M.Werlberger for clarifying details about their papers,and D.Scharstein for maintaining the online opticalflow benchmark.
References
[1]http://www.cs.brown.edu/people/dqsun/
[2]http://gpu4vision.icg.tugraz.at/
[3] E.H.Adelson,C.H.Anderson,J.R.Bergen,P.J.Burt,and J.M.
Ogden.Pyramid methods in image processing.RCA Engineer, 29(6):33–41,1984.
[4]O.M.Aodha,G.Brostow,and M.Pollefeys.Segmenting video into
classes of algorithm-suitability.In CVPR,2010.
[5] A.Bab-Hadiashar and D.Suter.Robust opticflow computation.
IJCV,29(1):59–77,1998.
[6]S.Baker, D.Scharstein,J.Lewis,S.Roth,M.J.Black,and
R.Szeliski.A database and evaluation methodology for opticalflow.
In ICCV,2007.
[7]J.Barron,D.Fleet,S.Beauchemin,and T.Burkitt.Performance of
opticalflow techniques.IJCV,92:236–242,1994.
[8]J.Bergen,P.Anandan,K.Hanna,and R.Hingorani.Hierarchical
model-based motion estimation.In ECCV,pages237–252,1992. [9]M.Black and A.Jepson.Estimating optical-flow in segmented im-
ages using variable-order parametric models with local deformations.
PAMI,18(10):972–986,1996.
[10]M.J.Black and P.Anandan.The robust estimation of multiple mo-
tions:Parametric and piecewise-smoothflowfields.CVIU,63:75–104,1996.[11] A.Blake and A.Zisserman.Visual Reconstruction.The MIT Press,
Cambridge,Massachusetts,1987.
[12]T.Brox,A.Bruhn,N.Papenberg,and J.Weickert.High accuracy
opticalflow estimation based on a theory for warping.In ECCV, pages25–36,2004.
[13] A.Bruhn,J.Weickert,and C.Schn¨o rr.Lucas/Kanade meets
Horn/Schunck:combining local and global opticflow methods.
IJCV,61(3):211–231,2005.
[14] A.Buades,B.Coll,and J.Morel.A non-local algorithm for image
denoising.In CVPR,2005.
[15]G.Gilboa and S.Osher.Nonlocal operators with applications to im-
age processing.Multiscale Model.Simul.,7:1005–1028,2008. [16] F.Glaer,G.Reynolds,and P.Anandan.Scene matching by hierar-
chical correlation.In CVPR,pages432–441,1983.
[17] B.Horn.Robot Vision.MIT Press,1986.
[18] B.Horn and B.Schunck.Determining opticalflow.Artificial Intelli-
gence,16:185–203,Aug.1981.
[19]I.Hsiao, A.Rangarajan,and G.Gindi.A new convex edge-
preserving median prior with applications to tomography.IEEE TMI, 22(5):580–585,2003.
[20]R.G.Keys.Cubic convolution interpolation for digital image pro-
cessing.IEEE TASSP,29:1153–1160,1981.
[21] C.Lei and Y.-H.Yang.Opticalflow estimation on coarse-to-fine
region-trees using discrete optimization.In ICCV,2009.
[22]V.Lempitsky,S.Roth,and C.Rother.FusionFlow:Discrete-
continuous optimization for opticalflow estimation.In CVPR,2008.
[23]Y.Li and S.Osher.A new median formula with applications to PDE
based denoising.Commun.Math.Sci.,7(3):741–753,2009. [24] B.Lucas and T.Kanade.An iterative image registration technique
with an application to stereo vision.In IJCAI,pages674–679,1981.
[25]H.-H.Nagel and W.Enkelmann.An investigation of smoothness
constraints for the estimation of displacement vectorfields from im-age sequences.PAMI,8(5):565–593,1986.
[26]W.H.Press,W.T.Vetterling,S.A.Teukolsky,and B.P.Flannery.
Numerical Recipes in C++:the art of scientific computing.Cam-bridge University Press,New York,NY,USA,2002.
[27]X.Ren.Local grouping for opticalflow.In CVPR,2008.
[28]L.I.Rudin,S.Osher,and E.Fatemi.Nonlinear total variation based
noise removal algorithms.Phys.D,60(1-4):259–268,1992. [29]P.Sand and S.Teller.Particle video:Long-range motion estimation
using point trajectories.IJCV,80(1):72–91,2008.
[30] D.Sun,S.Roth,and M.J.Black.A quantitative analysis of current
practices in opticalflow estimation and the principles behind them.
Technical report,Brown-CS-10-03,2010.
[31] D.Sun,S.Roth,J.Lewis,and M.J.Black.Learning opticalflow.In
ECCV,volume3,pages83–97,2008.
[32] A.Wedel,T.Pock,J.Braun,U.Franke,and D.Cremers.Duality
TV-L1flow with fundamental matrix prior.In IVCNZ,2008. [33] A.Wedel,T.Pock,and D.Cremers.Structure-and motion-adaptive
regularization for high accuracy opticflow.In ICCV,2009.
[34] A.Wedel,T.Pock,C.Zach,D.Cremers,and H.Bischof.An im-
proved algorithm for TV-L1opticalflow.In Dagstuhl Motion Work-shop,2008.
[35]M.Werlberger,W.Trobin,T.Pock,A.Wedel,D.Cremers,and
H.Bischof.Anisotropic Huber-L1opticalflow.In BMVC,2009.
[36]J.Xiao,H.Cheng,H.Sawhney,C.Rao,and M.Isnardi.Bilateral
filtering-based opticalflow estimation with occlusion detection.In ECCV,pages I:211–224,2006.
[37]L.Xu,J.Chen,and J.Jia.A segmentation based variational model
for accurate opticalflow estimation.In ECCV,pages671–684,2008.
[38]K.Yoon and I.Kweon.Adaptive support-weight approach for corre-
spondence search.PAMI,28(4):650–656,2006.
[39] C.Zach,T.Pock,and H.Bischof.A duality based approach for
realtime TV-L1opticalflow.In DAGM,2007.
[40]H.Zimmer, A.Bruhn,J.Weickert,L.Valgaerts, A.Salgado,
B.Rosenhahn,and H.-P.Seidel.Complementary opticflow.In
EMMCVPR,2009.
[41] C.Zitnick,N.Jojic,and S.B.Kang.Consistent segmentation for
opticalflow estimation.In ICCV,pages1308–1315,2005.
