## Posts Tagged ‘object recognition’

## A quick peek behind the curtain: Position detection, “Where are you?” (Part 3)

Hi everyone, and Happy new year to you from 3DCalifornia!

First of all, we would like to thank you all for this tremendous year 2010 we had, and wish you the best for 2011! And because we love 3D, we did a small demo for you, using our partner’s technology D’Fusion, and it is available here. Feel free to try it and tell us what you think of it!

### Let’s open our eyes

So we were on this series of articles about position detection, and this episode is supposed to be showing how computer vision can be used for that. Here we go!

First things first, what is computer vision? We explained briefly in one of the previous episodes what light was and what colors were. Our eyes can perceive lights and colors, and that’s mainly what they do. Then they send the information to the brain where lights and colors (low level information) become distances, objects, faces (known, unknown), words, etc (high level information).

Computer vision is the domain that studies the algorithm a computer needs to see, and to see high level information. To recognize that 2 objects are identical in 2 different images, to recognize a bunch of pixels as a tree or a bike, to recognize a face in an image and to know that it’s someone’s face in particular. And as this is the subject of this article, to determine the position of an object inside an image.

### Two birds with one stone

So, why should we use computer vision for position detection? Because in most cases, we already have all the hardware up and running: we’re trying to do augmented reality, and for that, we need to add virtual objects to a live video stream. So in most cases, we already have a camera. This single piece of hardware will be allowing us to do both reality sampling and position detection. No need for expensive magnetic device, no need for infrared lights. Just a camera and a computer.

A computer vision algorithm can be more or less complicated, but it will usually rely on one thing: the value of the pixels. We have a pixel. Its color, or brightness can be quantified. So if we replace all the pixels by their numerical value, we now have a grid of numbers, a matrix, and mathematics are really good at taking information out of matrices, so that’s all. We throw some maths at our digital image, and we get the info.

So what’s the battle plan? We have an incoming video stream and we want to output its coordinates, as fast as possible. There are multiple solutions, and most of them will differ by 3 aspects: Assumptions, Learning and Running. In order to find your favorite teapot, you first have to know what a teapot is, what your favorite teapot looks like and then you need to look everywhere and figure out if you can see it. The steps are the same here.

### “The least questioned assumptions are often the most questionable”

The quote is from Paul Broca. The question is “what do we know about the object we want to search”. For example, if it is a building, we know that we will probably see a lot of horizontal and vertical lines. If we’re looking for an old augmented reality marker, we will see a thick black square with black and white squares inside. And with Total Immersion’s technology, Markerless Tracking (MLT), only few assumptions are necessary, only considering we will be seeing an image that has no symmetry and that has some contrast in it. So first, we decide what kind of assumptions we keep. The larger the assumptions, the easier computation will be, but the more constraints it will create.

### “The moment you stop learning, you stop leading”

This quote of Rick Warren will sure be a good introduction. Learning is the phase where we give the algorithm a way to differentiate any image that fits the assumptions and the image we’re specifically looking for. With MLT, for example, it is the step where we give the algorithm a clear view of the target we’ll be tracking. In this phase, which is usually not real time, we will extract features (middle level information) from images such as borders, corners, interest points or keypoints.

Keypoints are points that have a special property (usually a mathematical property), and this property is chosen to be stable, which means that when the object moves within your video stream, these properties will stay with the objects and still be visible. During the learning phase, you learn how to position the keypoints from one another. And now you’re ready for the race.

### Running… For president?

So this is it. Now, we have an user in front of the camera, and we need to get the position of a target he has in his hands. So what do we do? We use the algorithm we prepared. If we learned the positions on the corners, we’ll be analyzing the image, searching for corners. If we learned the keypoints, we’ll look for the keypoints. It’s easy to determine that the image we’re looking for is indeed in the video stream. The hard part is to determine where. That’s where some clever filtering and modeling algorithms like RANSAC take place. RANSAC (for RANdom SAmple Consensus) take some data as an input (let’s say the points in the image A), and a model (let’s say a line) from which we want to find the parameters (parameters for a line will be height and slope for example), then will look for the points that fits the model the best, and will completely ignore the points that don’t fit it. It will then output the good points, and their good model (the blue line).

In an exact similar way, given all the keypoints on the image and a model (the keypoints from the learning, the unknown parameters will be their position and orientation), and RANSAC gives us the proper model (position and orientation) that fit our object’s keypoint, ignoring the background keypoints.

### Conclusion

Pfew, that was quite a trip, wasn’t it? We did it! We now have the position and orientation of our object in the video stream. And, even better: now we know the position in this image, it will be even easier to find it in the next image, because we know it can’t have moved that much. We can make new assumptions, and new assumptions mean it’s easier to compute, which means it runs faster!

So that was a few selected ideas of how mathematics and their applications in computer vision can really make your life easier when you’re trying to augment reality. And this concludes this three-parted Quick peek behind the curtain article. I hope you liked it! If you want me to tackle a specific subject on Augmented and Virtual technologies next time, feel free to drop a comment, a mail, or anything.