{"title": "Illumination and View Position in 3D Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 404, "page_last": 411, "abstract": null, "full_text": "Illumination and View Position in 3D Visual \n\nRecognition \n\nAmnon Shashua \n\nM.LT. Artificial Intelligence Lab., NE43-737 \n\nand Department of Brain and Cognitive Science \n\nCambridge, MA 02139 \n\nAbstract \n\nIt is shown that both changes in viewing position and illumination con(cid:173)\nditions can be compensated for, prior to recognition, using combinations \nof images taken from different viewing positions and different illumina(cid:173)\ntion conditions. It is also shown that, in agreement with psychophysical \nfindings, the computation requires at least a sign-bit image as input -\ncontours alone are not sufficient. \n\n1 \n\nIntroduction \n\nThe task of visual recognition is natural and effortless for biological systems, yet \nthe problem of recognition has been proven to be very difficult to analyze from \na computational point of view. The fundamental reason is that novel images of \nfamiliar objects are often not sufficiently similar to previously seen images of that \nobject. Assuming a rigid and isolated object in the scene, there are two major \nsources for this variability: geometric and photometric. The geometric source of \nvariability comes from changes of view position. A 3D object can be viewed from a \nvariety of directions, each resulting with a different 2D projection. The difference is \nsignificant, even for modest changes in viewing positions, and can be demonstrated \nby superimposing those projections (see Fig. 4, first row second image). Much \nattention has been given to this problem in the visual recognition literature ([9], \nand references therein), and recent results show that one can compensate for changes \nin viewing position by generating novel views from a small number of model views \nof the object [10, 4, 8]. \n\n404 \n\n\fIllumination and View Position in 3D Visual Recognition \n\n405 \n\nFigure 1: A 'Mooney' image. See text for details. \n\nThe photometric source of variability comes from changing illumination conditions \n(positions and distribution of light sources in the scene). This has the effect of \nchanging the brightness distribution in the image, and the location of shadows \nand specular reflections. The traditional approach to this problem is based on the \nnotion of edge detection. The idea is that discontinuities in image brightness remain \nstable under changes of illumination conditions. This invariance is not complete \nand furthermore it is an open question whether this kind of contour information is \nsufficient, 01\u00b7 even relevant, for purposes of visual recognition. \n\nConsider the image in Fig. 1, adopted from Mooney's Closure Faces Test [6]. Most \nobservers show no difficulty in interpreting the shape of the object from the right(cid:173)\nhand image, but cannot identify the object when presented with only the contours. \nAlso, many of the contours are shadow contours and therefore critically rely on the \ndirection of light source. In Fig. 2 four frontal images of a doll from four different \nillumination conditions are shown together with their intensity step edges. The \nchange in the contour image is significant and is not limited to shadow contours \nsome object edges appear or disappear as a result of the change in brightness \n-\ndistribution. Also shown in Fig. 4 is a sign-bit image of the intensity image followed \nby a convolution with a Difference of Gaussians. As with the Mooney image, it is \nconsiderably more difficult to interpret the image of a complex object with only the \nzero-crossing (or level-crossing) contours than when the sign-bits are added. \n\nIt seems, therefore, that a successful recognition scheme should be able to cope \nwith changes in illumination conditions, as well as changes in viewing positions, by \nworking wit.h a richer source of information than just contours (for a different point \nof view, see [1]). The minimal information that seems to be sufficient, at least for \ncoping with the photometric problem, is the sign-bit image. \n\nThe approach to visual recognition in this study is in line with the 'alignment' \napproach [9] and is also inspired by the work of Ullman and Basri [10] who show that \nthe geometric source of variability can be handled by matching the novel projection \nto a linear combination of a small number of previously seen projections of that \nobject. A recognition scheme that can handle both the geometric and photometric \nsources of variability is suggested by introducing three new results: (i) any image of a \nsurface with a linear reflectance function (including Lambertian and Phong's model \nwithout point specularities) can be expressed as a linear combination of a fixed \nset of three images of that surface taken under different illumination conditions, \n(ii) from a computational standpoint, the coefficients are better recovered using the \n\n\f406 \n\nShashua \n\nsign-bit image rather than the contour image, and (iii) one can compensate for both \nchanges in viewing position and illumination conditions by using combinations of \nimages taken from different viewing positions and different illumination conditions. \n\n2 Linear Combination of Images \n\nWe start by assuming that view position is fixed and the only parameter that is \nallowed to change is the positions and distribution oflight sources. The more general \nresult that includes changes in viewing positions will be discussed in section 4. \n\nProposition 1 All possible images of a surface, with a linear reflectance function, \ngenerated by all possible illumination conditions (positions and distribution of light \nsources) are spanned by a linear combination of images of the 8urface taken from \nindependent illumination conditions. \n\nProof: Follows directly from the general result that if /j (x), x E Rk, j = 1, ... , k, \nare k linear functions, which are also linearly independent, then for any linear \nfunction f(x), we have that f(x) = Lj aj!i(x), for some constants aj. 0 \nThe simplest case for which this result holds is the Lambertian reflectance model \nunder a point light source (observed independently by Yael Moses, personal com(cid:173)\nmunication). Let r be an object point projecting to p . Let nr represent the normal \nand albedo at r (direction and magnitude), and s represent the light source and \nits intensity. The brightness at p under the Lambertian model is I(p) = nr . 8, \nand because 8 is fixed for all point p, we have I(p) = al II (p) + a2h(p) + a313(p) \nwhere Ij(p) is the brightness under light source 8j and where 81,82,83 are linearly \nindependent. This result generalizes, in a straightforward manner, to the case of \nmultiple light sources as well. \n\nThe Lambertian model is suitable for matte surfaces, i.e. surfaces that diffusely \nreflect incoming light rays. One can add a 'shininess' component to account for \nthe fact that for non-ideal Lambertian surfaces, more light is reflected in a direc(cid:173)\nIn Phong's model of \ntion making an equal angle of incidence with reflectance. \n. h)C where h is the bisector of 8 and \nreflectance [7] this takes the form of (n r \nthe viewer's direction v. The power constant c controls the degree of sharpness of \nthe point specularity, therefore outside that region one can use a linear version of \nPhong's model by replacing the power constant with a multiplicative constant, to \nget the following function: I(p) = nr . [8 + p( v + 8)]. As before, the bracketed vector \nis fixed for all image points and therefore the linear combination result holds. \n\nThe linear combination result suggests therefore that changes in illumination can \nbe compensated for, prior to recognition, by selecting three points (that are visible \nto 8,81,82,83) to solve for aI, a2, a3 and then match the novel image I with I' = \nLj aj I j . The two images should match along all points p whose object points rare \nvisible to 81, S2, 83 (even if nr \u00b78 < 0, i.e. p is attached-shadowed); approximately \nmatch along points for which nr . Sj < 0, for some j (Ij(p) is truncated to zero, \ngeometrically 8 is projected onto the subspace spanned by the remaining basis light \n\nsources) and not match along points that are cast-shadowed in I (nr . 8 > \u00b0 but \n\nr is not visible to 8 because of self occlusion). Coping with cast-shadows is an \nimportant task, but is not in the scope of this paper . \n\n\fIllumination and View Position in 3D Visual Recognition \n\n407 \n\nFigure 2: Linear combination of model images taken from the same viewing positIOn \nand under different illumination conditions. Row 1,2: Three model images taken under \na varying point light source, and the input image, and their brightness edges. Row 3: \nThe image generated by the linear combination of the model images, its edges, and the \ndifference edge image between the input and generated image. \n\nThe linear combination result also implies that, for the purposes of recognition, one \ndoes not need to recover shape or light source direction in order to compensate for \nchanges in hrightness distribution and attached shadows. Experimental results, on \na non-ideal Lambertian surface, are shown in Fig. 2. \n\n3 Coefficients fronl Contours and Sign-bits \n\nMooney pictures, such as in Fig. 1, demonstrate that humans can cope well with \nsituations of varying illumination by using only limited information from the input \nimage, namely the sign-bits, yet are not able to do so from contours alone. This \nobservation can be predicted from a computational standpoint, as shown below. \n\nProposition 2 The coejJiczents that span an image I by the basis of three other \nimages, as descnbed in proposition 1, can be solved, up to a common scale factor, \n\n\f408 \n\nShashua \n\nFigure 3: Compensating for both changes in view and illumination. Row 1: Three model \nimages, one of which is taken from a different viewing direction (23 0 apart), and the input \nimage from a novel viewing direction (in between the model images) and illumination \ncondition. Row 2: difference image between the edges of the input image (shown separately \nin Fig. 4) and the edges of the view transformed first model image (first row, lefthand), \nthe final generated image (linear combination of the three transformed model images), its \nedges, and the difference image between edges of input and generated image. \n\nzero-crossings or level-crossings. \n\nfrom just the contours of I -\nProof: Let aj be the coefficients that span I by the basis images Ij, j = 1,2,3, i.e. \nI = Lj aj Ij. Let f, J; be the result of applying a Difference of Gaussians (DOG) \noperator, with the same scale, on images I, Ij , j = 1,2,3. Since DOG is a linear \noperator we have that f = Lj aj J;. Since J(p) = 0 along zero-crossing points p of \nI, then by taking any three zero-crossing points, which are not on a cast-shadow \nborder, we get a homogeneous set of equations from which aj can be solved up to \na common scale factor. \n\nSimilarly, let k be an unknown threshold applied to I. Therefore, along level cross(cid:173)\nings of I we have k = Lj aj Ij , hence 4 level-crossing points, that are visible to all \nfour light sources, are sufficient to solve for aj and k. D \nThis result is in accordance with what is known from image compression literature \nof reconstructing an image, up to a scale factor, from contours alone [2]. In both \ncases, here and in image compression, this result may be difficult to apply in practice \nbecause the contours are required to be given at sub-pixel accuracy. One can relax \nthe accuracy requirement by using the gradients along the contours -\na technique \nthat works well in practice. Nevertheless, neither gradients nor contours at sub(cid:173)\npixel accuracy are provided by Mooney pictures, which leaves us with the sign- bits \nas the source of information for solving for the coefficients. \n\n\fIllumination and View Position in 3D Visual Recognition \n\n409 \n\nFigure 4: Compensating for changes in viewing position and illumination from a single \nview (model images are all from a single viewing position). Model images are the same \nas in Fig. 2, input image the same as in Fig. 3. Row 1: edges of input image, overlay \nof input edge image and edges of first model image, overlay with edges of the 2D affine \ntransformed first model image, sign-bit input image with marked 'example' locations (16 \nof them). Row 2: linear combination image of the 2D affine transformed model images, \nthe final generated image, its edges, overlay with edges of the input image. \n\nProposition 3 Solving for the coefficients from the sign- bit image of I is equtv(cid:173)\nalent to solving for a separating hyperplane in 3D in which image points serve as \n'examples '. \nProof: Let z(p) = (II, 12, hf be a vector function and w = (aI, a2, a3)T be the \nunknown weight vector. Given the sign-bit image j of I, we have that for every \npoint p, excluding zero-crossings, the scalar product wT z(p) is either positive or \nnegative. In this respect , one can consider points in j as 'examples' in 3D space \nand the coefficients aj as a vector norma) to the separating hyperplane. 0 \nA similar result can be obtained for the case of a thresholded image. The separating \nhyperplane in that case is defined in 4D, rather than 3D. Many schemes for finding a \nseparating hyperplane have been described in Neural Network literature (see [5] for \nreview) and in Discriminant Analysis literature ([3], for example). Experimental \nresults shown in the next section show that 10-20 points, distributed over the \nentire object, are sufficient to produce results that are indistinguishable from those \nobtained from an exact solution. \n\nBy using the sign-bits instead of the zero-crossing contours we are trading a unique \n(up to a scale factor), but unstable, solution for an approximate, but stable, one. \nAlso, by taking the sample points relatively far away from the contours (in order to \nminimize the chance of error) the scheme can tolerate a certain degree of misalign-\n\n\f410 \n\nShashua \n\nment between the basis images and the novel image. This property will be used \nin one of the schemes, described below, for combining changes of viewing positions \nand illumination conditions. \n\n4 Changing Illumination and Viewing Positions \n\nIn this section, the recognition scheme is generalized to cope with both changes in \nillumination and viewing positions. Namely, given a set of images of an object as \na model and an input image viewed from a novel viewing position and taken under \na novel illumination condition we would like to generate an image, from the model, \nthat is similar to the input image. \n\nProposition 4 Any set of three images, satisfying conditions of proposition 1, of \nan object can be used to compensate for both changes in view and illumination. \n\nProof: Any change in viewing position will induce both a change in the location \nof points in the image, and a change in their brightness (because of change in \nviewing angle and change in angle between light source and surface normal). From \nproposition 1, the change in brightness can be compensated for provided all the \nimages are in alignment. What remains, therefore, is to bring the model images \nand the input image into alignment. \nCase 1: If each of the three model images is viewed from a different position, then \nthe remaining proof follows directly from the result of Ullman and Basri [10] who \nshow that any view of an object with smooth boundaries, undergoing any affine \ntransformat.ion in space, is spanned by three views of the object. \n\nCase 2: If only two of the model images are viewed from different positions, then \ngiven full correspondence between all points in the two model views and 4 corre(cid:173)\nsponding points with the input image, we can transform all three model images \nto align wit.h the input image in the following way. The 4 corresponding points \nbetween the input image and one of the model images define three corresponding \nvectors (taking one of the corresponding points, say 0, as an origin) from which a 2D \naffine transformation, ma.trix A and vector w, can be recovered. The result, proved \nin [8], is tha.t for every point p' in the input image who is in correspondence with p \nin the model image we have that p' = [Ap + 0' - Ao] + apw. The parameter a p is \ninvariant to any affine transformation in space, therefore is also invariant to changes \nin viewing position. One can, therefore, recover ap from the known correspondence \nbetween two model images and use that to predict the location p'. It can be shown \nthat this scheme provides also a good approximation in the case of objects with \nsmooth boundaries (like an egg or a human head, for details see [8]). \n\nCase 3: All three model images are from the same viewing position. The model \nimages are first brought into 'rough alignment' (term adopted from (10)) with the \ninput image by applying the transformation Ap + 0' - Ao + w to all points p in each \nmodel image. The remaining displacement between the transformed model images \nand the input image is (ap -\nl)w which can be shown to be bounded by the depth \nvariation of the surface [8]. (In case the object is not sufficiently fiat, more than \n4 points may be used to define local transformations via a triangulation of those \npoints). The linear combination coefficients are then recovered using the sign-bit \n\n\fIllumination and View Position in 3D Visual Recognition \n\n411 \n\nscheme described in the previous section. The three transformed images are then \nlinearly combined to create a new image that is compensated for illumination but \nis still displaced from the input image. The displacement can be recovered by using \na brightness correlation scheme along the direction w to find Q p - 1 for each point \np. (for details, see [B]). 0 \nExperimental results of the last two schemes are shown in Figs. 3 and 4. The four \ncorresponding points, required for view compensation, were chosen manually along \nthe tip of eyes, eye-brow and mouth of the doll. The full correspondence that is \nrequired between the third model view and the other two in scheme 2 above, was es(cid:173)\ntablished by first taking two pictures of the third view, one from a novel illumination \ncondition and the other from a similar illumination condition to one of the other \nmodel images. Correspondence was then determined by using the scheme described \nin [B]. The extra picture was then discarded. The sample points for the linear \ncombination were chosen automatically by selecting 10 points in smooth brightness \nregions. The sample points using the sign-bit scheme were chosen manually. \n\n5 Summary \n\nIt has been shown that the effects photometry and geometry in visual recognition \ncan be decoupled and compensated for prior to recognition. Three new results were \nshown: (i) photometric effects can be compensated for using a linear combination \nof images, (ii) from a computational standpoint, contours alone are not sufficient \nfor recognition, and (iii) geometrical effects can be compensated for from any set of \nthree images, from different illuminations, of the object. \n\nAcknowledgments \n\nI thank Shimon Ullman for his advice and support. Thanks to Ronen Basri, Tomaso \nPoggio, Whitman Richards and Daphna Weinshall for many discussions. A.S. is \nsupported by NSF grant IRI-B900267. \n\nReferences \n\n[1] Cavana.gh,P. Proc. 19th ECVP, Andrei, G. (Ed.), 1990. \n[2] Curtis,S.R and Oppenheim,A.V. in Whitman,R. and Ullman,S. (eds.) Image Under(cid:173)\n\nstanding 1989. pp.92-110, Ablex, NJ 1990. \n\n[3] Duda,R.O. and Hart,P.E. pattern classification and scene analysis. NY, Wiley 1973. \n[4] Edelman,S. and Poggio,T. Massachusetts Institute of Technology, A.I. Memo 1181, \n\n1990 \n\n[5] Lippmann,R.P. IEEE ASSP Magazine, pp.4-22, 1987. \n[6] Mooney,C.M. Can. 1. Psychol. 11:219-226, 1957. \n[7] Phong,B.T. Comm. A CM, 18, 6:311-317, 1975. \n[8] Shashua,A. Massachusetts Institute of Technology, A.I. Memo 1927, 1991 \n[9] Ullman,S. Cognition,32:193-254, 1989. \n[10] Ullman,s. and Basri,R. Massachusetts Institute of Technology, A.I. Memo 1052, 1989 \n\n\f", "award": [], "sourceid": 463, "authors": [{"given_name": "Amnon", "family_name": "Shashua", "institution": null}]}