Just last week, the internet went into a frenzy over Microsoft’s how old robot, which is able to more or less predict a person’s age based on a single image. While it wasn’t perfect, people were obsessed with how close (or far off) their results were. I knew something interesting was going on under the hood, so I spoke with Dan Becker, one of our data science instructors, about what technology Microsoft might be using to conjure up this magic.
It’s all in the Pixels
We all know that images are comprised of pixels. So how does a computer understand how these tiny units of data relate to each other and use high-level abstractions to draw conclusions about something as complicated as age?
“A computer represents pixels as numbers. For example, a darker pixel might be a higher number, and a lighter pixel might be a lower number,” said Dan.
When folks first started trying to get computers to understand images, they used to figure out a specific pattern for computers to look for – for example, lines intersecting at 90 degrees, parallel lines, or shapes that resembles curves. But now data science has advanced enough for computers to look at layers and interactions that build upon each other. According to Dan, here’s what a computer might do now:
- Look for places where there are curved lines
- If it notices two curved lines interacting in a specific way, it might look for more details (for example, it might conclude that the pattern is an eye)
- Once it recognizes all of the interesting features, it can then build on top of everything it knows to determine the overall structure (in this case, a face)
When a computer first analyzes an image, it has no idea what it’s looking at. People used to have computer make the jump from pixels to faces all in one go, but now a computer breaks it down step by step, pattern by pattern.
Using deep learning and representational learning to build many intermediate steps between “here are some pixels” to “this is a face,” makes it possible for a computer to recognize all sorts of patterns and interesting features, such as curved lines, eyes, mouths, and even the possible age of the person in the photo.
“We no longer need to make the jump from pixels to recognizing a face in a single step,” said Dan.
Age is nothing but a (very complicated) number
So how is Microsoft’s app able to make the jump from lines and faces to an actual age?
According to Dan, “their team likely has a huge repository of images to work with that fall into two categories: ones where they know the actual age of the person, and ones where they don’t, such as images from a Bing search.”
Using something called neural networks, they’re able to build many of levels of abstraction that go beyond just seeing a face – and a computer can get closer and closer to being able to reach a conclusion about something like age. Since they’re checking this guess against a huge collection of images, they can tweak the final step of the system many times in order to get increasingly accurate.
Additionally, most programs like this use something called a convolutional neural network. Instead of doing pixel to pixel analysis, an algorithm will start out with a smaller snippets of the image and analyze those first. For example, it might just look at patches made up of 10×10 pixels and see how each of those interact with each other.
“If you didn’t look at one patch at a time, the computer might be making comparisons that aren’t relevant. The interesting interactions that happen in an image are from pixels that are closer together,” said Dan. “This type of neural network makes it easier to analyze images in a way that reflects how images are perceived by humans.”
How one little thing can screw everything up
“One of the most fascinating things in this field of data science is known as the adversarial pixel,” said Dan. “Neural networks are very good at figuring out what an image is – say a boat, a cat, or a house – but many of them can be tricked by changes in just a few pixels that a human wouldn’t even notice.”
For example, if you put two images that appeared identical to a human – but had a few pixels of difference between them – a neural network might determine that one image is a bird and the other is a fish.
According to Dan, “One pixel may be enough to change one layer of the network, and this can cascade down into affecting the final outcome in a big way.”
While this is a fascinating concept, it doesn’t affect most of the real-world applications of this kind of technology, so there hasn’t been a whole lot of research that’s gone into it.
While this is a basic overview, it’s more or less how modern image analysis and apps like how-old.net work. Want to to hear more about the data science and programming that powers your favorite applications? Let us know what you want to read about in the comments below or on Twitter.