Next: A robust image transform Up: A radial cumulative similarity Previous: A radial cumulative similarity

Introduction

Finding corresponding points in image pairs or image sequences is a central problem in computer vision. Most classical methods assume brightness constancy, and perform best when tracking high-contrast regions that lie on a single surface. However, many images have visually important features that violate this assumption. Developing methods to track corresponding points which lie on occluding boundaries is necessary if one is to track complicated objects with multiple articulated surfaces, such as the human face.

**Figure 1:** Correspondence is difficult when a uniform surface moves across different background patterns. Consider the correspondence of window A with windows B or C; traditional robust methods equate the match between A:B and A:C, since the ``outlier'' regions in each is equally different.
$\psfig {figure=fig1a.ps,width=1.5in}$ $\psfig {figure=fig1b.ps,width=1.5in}$

In recent years, robust estimation methods have been applied to image correspondence, and have been shown to considerably improve performance in cases of occlusion. Black and Anandan pioneered robust optic flow using redescending error norms that substantially discount the effect of outliers [1]. Shizawa and Mase derived methods for transparent local flow estimation [2]. Bhat and Nayar have advocated the use of rank statistics for robust correspondence [4]; Zabih and Woodfill use ordering statistics combined with spatial structure in the CENSUS transform [5]. Several authors have explored methods of finding image "layers" to pool motion information over arbitrarily shaped regions of support and to iteratively refine parameter estimates [6,8,7], but these methods generally assume models of global object motion to define coherence.

**Figure 2:** Finding local correspondences in regions with occlusion is a difficult challenge. (a,e) and (c,g) are images taken before and after user's expression changes; (b,f) and (d,h) are enlarged views of corresponding points, with a cross drawn to indicate the center point of the window. Traditional correspondence methods have difficulty at points such as these, where there is little foreground texture, substantial occlusion, and variable sign of contrast at the occlusion boundary.
(a) $\psfig {figure=eyeim1.ps,width=2in}$ (b) $\psfig {figure=eye/rgb2.ps,height=0.75in}$ (c) $\psfig {figure=mouthim2.ps,height=0.75in}$ (d) $\psfig {figure=mouth1/rgb1.ps,height=0.75in}$ (e) $\psfig {figure=eyeim2.ps,height=0.75in}$ (f) $\psfig {figure=eye/rgb1.ps,height=0.75in}$ (g) $\psfig {figure=mouthim1.ps,height=0.75in}$ (h) $\psfig {figure=mouth1/rgb2.ps,height=0.75in}$

However, these methods make a critical assumption: that there will be sufficient contrast in the foreground (''inlier'') portion of an analysis window to localize the correspondence match. This is often not true, due either to a uniform foreground surface or low-resolution video sampling. This problem is illustrated in Figure 1, which shows a foreground region with zero contrast in front of two different background regions; note that the sign of contrast changes at the occlusion boundary between the two frames. A example in real imagery is shown in Figure 2; the marked locations pose a considerable challenge for existing robust correspondence methods, since any window large enough to include substantial foreground contrast will include a very large percentage of outliers.

Most robust and non-robust correspondence methods fail when there is no coherent foreground contrast. Transparent-motion analysis [2,3,9,10] can potentially detect motion in these difficult cases, but has not, to date, been able to provide precise spatial localization of corresponding points. Smoothing methods such as regularization or parametric motion constraints (affine [11,12,13] or learned from examples [14]) can provide approximate localization when good motion estimates are available in nearby image regions, but this is not always the case. If a corpus of training images is available, techniques for feature or appearance modeling can solve these problems, c.f. [18,19].

For many detailed image analysis/synthesis tasks, finding precise correspondences such as shown in these figures is extremely important. Image compositing [15], automatic morphing [16], and video resynthesis [17], all require accurate correspondence and slight flaws can yield perceptually significant errors. To obtain good results, authors of these methods have relied on either extreme redundancy of measurement, human-assisted tracking, substantial smoothing, or domain-specific feature-appearance models.

In this paper, we describe a new method that can solve the correspondence tasks illustrated in Figures 1 and 2 using purely local image analysis, without prior training, and without smoothing or pooling of motion estimates. Our approach defines an image transform; this transform characterizes the local structure of an image in a manner insensitive to points in an occluded region (e.g., outliers), but which is sensitive to the shape of the occlusion boundary itself. In essence, our method is to perform matching on a redundant, local representation of image homogeneity. In this paper we show examples where color is the attribute analyzed for homogeneity, but our method is applicable to other local image characteristics (such as texture, range data, or simply image intensity). While we only show sparse tracking results, our method can readily yield dense correspondences, assuming sufficient image contrast.

Next: A robust image transform Up: A radial cumulative similarity Previous: A radial cumulative similarity

Trevor Darrell

9/9/1998