ISWYDS exploring object detection using Darknet and YOLOv4 @Design Museum Gent

Object Stories and AI meet. Part II

Olivier Van D'huynslager
5 min readNov 8, 2020

As we explained in the earlier post, convolutional neural networks are helpful when it comes to classifying single objects. In practice, it allows for us to feed an image into the network and if all goes well it outputs its corresponding label or class (for instance the object-number of the object it relates to). Aside from the tons of data (images), you actually need for training such a CNN, it’s a fairly straight forward process developing one when using the TensorFlow APIs. [read here]. However, what we set out to achieve was not only to recognize objects in a given series of photographs or videos but also about localizing them, and even doing so for multiple objects at a time. This is what we’ll be looking at in this second part. Exit CNN- Enter R-CNN.

Regions with Convolutional Neural Networks or R-CNN combine rectangular region proposals [basically bounding boxes] with convolutional neural network (CNN) features. However, developing our very own model proved to be more challenging then we expected:

  • problem 1: the introduction of big data; deep learning needs tons of raw data. answer 1: I gathered 3000+ images of 37 different objects (let’s call those classes from here on out). — not nearly as much as I wanted, but I'll have to do for now. — (there are approx. 300 objects on display at Object Stories, the goal is to get them all).
  • problem 2: deep learning relies on heavy computing power (GPUs and TPUs) and for one, being a mac user, Apple does not like Nvidea, who also happens to be the producer of the GPUs we need. And even if they did, buying the necessary processors would be very expensive! answer 2: via Google Cloud VM we can harness the GPUs and TPUs at Google to do our own training, this is not local, and it does come with a price tag — but it works. At least as long as you’re not building something that competes with their own R&D. So if you’re trying to build a self-driving car, you can forget about it. (btw; you could train on CPU but it won’t take you far though).
  • problem 3; it still takes a long time to train our model. In this case, it took a whole day’ — and for this, the only answer is; find something else to do in the meantime.

As mentioned above — in order to train our model we will have to annotate all our images with classified regions. This is a tedious and time-consuming task, but it needs to be done. There are several tools to help us do so, LabelImg being one of them. This tool allows you to load in a directory of images, annotate them, and spits out an XML (pascal-VOC) containing our metadata.

labelImg is a great open-source tool to generate our bounding boxes.

After repeating these steps for all our images we landed on 3000+ images featuring 37 classes (some images containing over 15 classes).

Picking your guns.

When it comes to object detection there are many options to pick from, but for our case, we will be using Darknet an open-source neural network framework written in C and CUDA to train our algorithm of choice; YOLOv4. You Only Look Once or YOLO is a state-of-the-art, real-time object detection system making R-CNN look stale, it is extremely fast, more than 1000x faster than R-CNN, and 100x faster than Fast R-CNN. Another good thing about YOLO is that it’s public domain and based on the license we can do whatever we want to do with it...🧐

                        YOLO LICENSE
Version 2, July 29 2016

THIS SOFTWARE LICENSE IS PROVIDED "ALL CAPS" SO THAT YOU KNOW IT IS SUPER SERIOUS AND YOU DON'T MESS AROUND WITH COPYRIGHT LAW BECAUSE YOU WILL GET IN TROUBLE HERE ARE SOME OTHER BUZZWORDS COMMONLY IN THESE THINGS WARRANTIES LIABILITY CONTRACT TORT LIABLE CLAIMS RESTRICTION MERCHANTABILITY. NOW HERE'S THE REAL LICENSE:

0. Darknet is public domain.
1. Do whatever you want with it.
2. Stop emailing me about it!

That being said; after configuring our system and installing all the needed dependencies we can take her for a test ride — using some of the pre-trained algorithms that come out of the box, and see what she has to offer.

As we can tell from the results [pictures below], YOLOv4 comes with some pre-trained classes such as a chair, vase, and of course the very -uncanny- human. Let’s not make use of that last one, shall we? Although it missed out on some of the more obscure looking vases, we will be using these pre-trained weights as building blocks to create our very own. The goal here is to output new classes based on the object-number of the objects depicted.

Fly me to the moon.

As mentioned above, we will be using Google’s virtual computing power using one Tesla T4 to train our very own weights. After loading everything in it took over 20 hours to achieve an average loss below 1. It’s recommended to go below 0,5 but for the sake of this being a test run and only encompassing a subset of the 300 objects we want to include we will stop training there until we have the full set at our disposal.

To test our model we will feed in a raw video of footage taken in Design Museum Gent of our permanent exhibition Object Stories. YOLO will then process each frame — only once (as seen on the left) — and return the confidence rate for each object it detects in that frame and will do so for the entire video. When finished, it will stitch all frames back together and if everything goes right and our model is well fit, we will finally see the first results of a long process. [fingers crossed].

--

--

Olivier Van D'huynslager

Digital Strategist @ Design Museum Gent | Strategic content manager @CoGhent | overall Culture Geek — interested in AI and its value for museums.