Computer-based artificial intelligence (AI) has been around since the 1940s, but the current innovation boom around everything from virtual personal assistants and visual search engines to real-time translation and driverless cars has led to new milestones in the field. And ever since IBM’s Deep Blue beat Russian chess champion Garry Kasparov in 1997, machine versus human milestones inevitably bring up the question of whether or not AI can do things better than humans (it’s the the inevitable fear around Ray Kurzweil’s singularity).
As image recognition experiments have shown, computers can easily and accurately identify hundreds of breeds of cats and dogs faster and more accurately than humans, but does that mean that machines are better than us at recognizing what’s in a picture? As with most comparisons of this sort, at least for now, the answer is little bit yes and plenty of no.
Less than a decade ago, image recognition was a relatively sleepy subset of computer vision and AI, found mostly in photo organization apps, search engines and assembly line inspection. It ran on a mix of keywords attached to pictures and engineer-programmed algorithms. As far as the average user was concerned, it worked as advertised: Searching for donuts under “Images” in Google delivered page after page of doughy pastry-filled pictures. But getting those results was enabled only by laborious human intervention in the form of manually inputting said identifying keyword tags for each and every picture and feeding a definition of the properties of said donut into an algorithm. It wasn’t something that could easily scale.
More recently, however, advances using an AI training technology known as deep learning are making it possible for computers to find, analyze and categorize images without the need for additional human programming. Loosely based on human brain processes, deep learning implements large artificial neural networks — hierarchical layers of interconnected nodes — that rearrange themselves as new information comes in, enabling computers to literally teach themselves.
As with human brains, artificial neural networks enable computers to get smarter the more data they process. And, when you’re running these deep learning techniques on supercomputers such as Baidu’s Minwa, which has 72 processors and 144 graphics processors (GPUs), you can input a phenomenal amount of data. Considering that more than three billion images are shared across the internet every day — Google Photos alone saw uploads of 50 billion photos in its first four months of existence — it’s safe to say that the amount of data available for training these days is phenomenal. So, is all this computing power and data making machines better than humans at image recognition?
There’s no doubt that recent advances in computer vision have been impressive . . . and rapid. As recently as 2011, humans beat computers by a wide margin when identifying images, in a test featuring approximately 50,000 images that needed to be categorized into one of 10 categories (“dogs,” “trucks” and others). Researchers at Stanford University developed software to take the test: It was correct about 80 percent of the time, whereas the human opponent, Stanford PhD candidate and researcher Andrej Karpathy, scored 94 percent.
Then, in 2012, a team at the Google X research lab approached the task a different way, by feeding 10 million randomly selected thumbnail images from YouTube videos into an artificial neural network with more than 1 billion connections spread over 16,000 CPUs. After this three-day training period was over, the researchers gave the machine 20,000 randomly selected images with no identifying information. The computer looked for the most recurring images and accurately identified ones that contained faces 81.7 percent of the time, human body parts 76.7 percent of the time, and cats 74.8 percent of the time.
At the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2014, Google came in first place with a convolutional neural network approach that resulted in just a 6.6 percent error rate, almost half the previous year’s rate of 11.7 percent. The accomplishment was not simply correctly identifying images containing dogs, but correctly identifying around 200 different dog breeds in images, something that only the most computer-savvy canine experts might be able to accomplish in a speedy fashion. Once again, Karpathy, a dedicated human labeler who trained on 500 images and identified 1,500 images, beat the computer with a 5.1 percent error rate.
This record lasted until February 2015, when Microsoft announced it had beat the human record with a 4.94 percent error rate. And then just a few months later, in December, Microsoft beat its own record with a 3.5 percent classification error rate at the most recent ImageNet challenge.
Deep learning algorithms are helping computers beat humans in other visual formats. Last year, a team of researchers at Queen Mary University London developed a program called Sketch-a-Net, which identifies objects in sketches. The program correctly identified 74.9 percent of the sketches it analyzed, while the humans participating in the study only correctly identified objects in sketches 73.1 percent of the time. Not that impressive, but as in the previous example with dog breeds, the computer was able to correctly identify which type of bird was drawn in the sketch 42.5 percent of the time, an accuracy rate nearly twice that of the people in the study, with 24.8 percent.
These numbers are impressive, but they don’t tell the whole story. “Even the smartest machines are still blind,” said computer vision expert Fei-Fei Li at a 2015 TED Talk on image recognition. Yes, convolutional neural networks and deep learning have helped improve accuracy rates in computer vision – they’ve even enabled machines to write surprisingly accurate captions to images — but machines still stumble in plenty of situations, especially when more context, backstory, or proportional relationships are required. Computers struggle when, say, only part of an object is in the picture – a scenario known as occlusion – and may have trouble telling the difference between an elephant’s head and trunk and a teapot. Similarly, they stumble when distinguishing between a statue of a man on a horse and a real man on a horse, or mistake a toothbrush being held by a baby for a baseball bat. And let’s not forget, we’re just talking about identification of basic everyday objects – cats, dogs, and so on — in images.
Computers still aren’t able to identify some seemingly simple (to humans) pictures such as this picture of yellow and black stripes, which computers seem to think is a school bus. This technology is, unsurprisingly, still in its infant stage. After all, it took the human brain 540 million years to evolve into its highly capable current form.
What computers are better at is sorting through vast amounts of data and processing it quickly, which comes in handy when, say, a radiologist needs to narrow down a list of x-rays with potential medical maladies or a marketer wants to find all the images relevant to his brand on social media. The things a computer is identifying may still be basic — a cavity, a logo — but it’s identifying it from a much larger pool of pictures and it’s doing it quickly without getting bored as a human might.
Humans still get nuance better, and can probably tell you more a given picture due to basic common sense. For everyday tasks, humans still have significantly better visual capabilities than computers.
That said, the promise of image recognition and computer vision at large is massive, especially when seen as part of the larger AI pie. Computers may not have common sense, but they do have direct access to real-time big data, sensors, GPS, cameras and the internet to name just a few technologies. From robot disaster relief and large-object avoidance in cars to high-tech criminal investigations and augmented reality (AR) gaming leaps and bounds beyond Pokemon GO, computer vision’s future may well lie in things that humans simply can’t (or won’t) do. One thing we can be certain of is this: It won’t take 540 million years to get there.