servers

Computer Vision for Datacenter Auditing

I am going to start a series of posts of random ideas I have had but not had time to fully implement. The first in this series is a idea I have worked on about ~3 years ago (November 2019)  for being able to audit a datacenter as well as map systems physicals location to their logical one in a network.

The core of the idea is to use cameras in a datacenter to see servers in the rack, these could be security cameras, then use that data to map out the datacenter and save the administrators time from having to manually preform these actions. The process begins by training a machine vision learning model on what a server looks like. Most of the time at work I am working with Dell servers so I thought that was a good starting point. To make the model generic enough I was just attempting to train it on what a 1U server looks like vs what a 2U server looks like.

At this point I needed A LOT of photos of servers with different lighting and angles. I took a bunch myself as a seed set of different racks I had, then I turned to the web. Where could I get a large assortment of photos of Dell servers in different configurations and lightings? The homelab section on reddit! People all the time post their setups at home and what they have. I went through and downloaded several hundreds photos of different peoples setups. Another place to get photos was from eBay where a lot of sellers put up photos of servers in different settings; the downside is that a lot of people reuse the same photos again and again. I don’t know if the internet has yet to figure out what the copyright rules of using photos from online to train a model.

I researched a bunch of different techniques and was playing around with OpenCV, but then found a tutorial that seemed to be in line with what I was looking to do. (This one is also good, and very similar material) I researched different image processing models, and played around with several.

Now that I had the photos, I downloaded GitHub – tzutalin/labelImg: 🖍️ LabelImg is a graphical image annotation tool and label object bounding boxes in images. This is a tool where you go to each photo, select the item you are trying to learn, and label it. This took a while, its fully manual work. A lot of the photos from the web had multiple servers in the photo, and each one would need to be selected. This proved to be one of the more time consuming parts of the project. I had to manipulate the photos to allow the rectangular bounding boxes to be able to fit the servers, even when the photos are at weird angles.

I had to pick some of the photos to be the training set, than other photos to be the testing set. With that metadata ready and everything marked, I converted the final metadata from XML to CSV, using xml_to_csv.py provided at the above example repos. That was then fed into Tensorflow. The system I had to start with for this other than a laptop was a CentOS 7 server, this proved to be very annoying because some dependencies such as protobuf were not available at new enough versions and had to be custom compiled.

It was time to let the model run for a while and see what it could learn. Several important things were learnt in this process. First, if you have GPUs makes sure you have a Tensorflow that is compiled and ready to use them. The speed you speed you get with and without them is kind of crazy. Also, more RAM and GPU helps a lot speed up the process. At first I was playing with this on just a laptop, and that one didn’t have the GPU drivers for CUDA. This was taking DAYS to work on the model. Later I switched to using GPUs I had in a server, and this greatly increased the iteration cycle speed.

Off the bat it was able to get a decent percentage recognition of the servers in the photos I had presented it! I do think a lot of the photos I then tested it on were fairly ideal conditions, with good lighting and camera angle. This may give a better than real world experience with it working. To improve the model I can always find more photos and train it with more images. I was able to get the model to recognize about 80% of the servers in racks I showed it at this time. Another factor that could help in the future is the evolution of cameras. A lot of places are replacing 720p/1080P cameras with 4k cameras, the more resolution the system has to work with the better.

The next step I wanted to do was start matching physical location to logical. The idea behind this is, I can find regions in a photo or video where servers are, and each server through its iDrac/IPMI allows me to blink front chassis lights. So one host at a time I will have automation send the command to blink the front chassis lights, and perhaps some lights on the HDDs, then scan for which region in the photo has started to blink!

This is the idea I have slowly worked on for the last little while, I have prototypes of most of it working, but have not had a lot of time to put into it.  The hope would be we could use existing cameras to get the footage we need to map existing datacenters we have. Then perhaps in the future port this system to something like Hololens, or Apple/Meta AR system. Once we have that mapping, now we can start to draw out the physical servers and their location in the world/racks on a webpage, and make it easier for people working in a datacenter to find boxes they need. Hopefully one day allowing for people to click a server on a webpage, and then connect into its controller without a human painstakingly going to each box and doing this mapping. Of course all of this is fixed by a team labeling each server, but where is the fun in that.

Homelab: Overview

I am starting a series about my homelab and how it is all laid out. I have written this article a few times, with months in between. Each time the setup changes, but we seem to be at a stable-ish point where I will start this series. Since I wrote this whole article and now a while later am editing it, I will mark with italics and underline when present me is filling in. I think it will give a neat split of growth in the last year or so I have been working on this. Or it will make it illegible, we will see. My home setup gives me a good chance to test out different operating systems and configs in a domain environment before using that tech elsewhere like at work.

Hypervisor

Starting off with virtualization technology, I settled a while ago on Microsoft Hyper-V instead of ESXi, the main reason behind it is I already had Windows Server, and Hyper-V allows for Dynamic memory, and allocating a range of memory for a VM. When something like an AD controller is idling, it doesn’t need much memory; when it starts it may, Dynamic memory allows me to take that into account. I will say one place that has bit me later is file storage, but that will be a later post.

The setup is technically “router on a stick”, where the Sophos XG firewall functions as the router, and the rest of the devices hang off of that. The Sophos XG machine is a old Dell Optiplex 990 (almost 10 years old!) with an Intel quad NIC in it. That way it can do hardware offloading for most of the traffic. I intend to do posts for networking, hypervisors, file storage, domain, and more; thus I will not get too in the weeds right now on the particulars.

The file storage is a FreeNAS box recently updated to 7, 3TB HDDS. I have had this box for about over 6 years (I just looked it up in November 2020, one of the drives has 55257 hours or 6.3 years of run time on it); it is older but has worked well for me so far.

The network backbone is a new switch I really like that I was able to get 2 of off eBay; they were broken but I was able to repair them, more on that later as well. They are Brocade, now Ruckus, ICX7150-12P; 12 1GB/s POE ports, 2 additional 1GB/s uplink port, and 2, 1/10GB/s SFP/SFP+ ports. These switches can run at layer 3, but I have the layer 2 firmware on them currently. They have a fiber connection between them, before that I was using 2 Unifi APs in a bridge, that didn’t work fantastic however because A. I am in NY, B. they were only 2×2 802.11AC Wave 1, and C. I am in NY. I custom ordered (so the significant other would not get mad) a white 50m fiber cable to go around the wall of the apartment.

With SSDs in the hypervisor boxes (I call them HV# for short) and iSCSI storage for VMs as well, which VMs are on which host doesn’t particularly matter. Flash forward 6 months or so, since that first sentence was written, I now still use the NAS for backups, but the hypervisors are running Storage Spaces Directed and doing shared storage now. This allows the hypervisors to move move VMs around during patching or pause during a system update if they are less critical. The Intel NUC and small Dell Inspiron are much under powered compared to the mid tower hypervisors, so they run usually only 1 or 2 things. The NUC runs the primary older domain controller, and that is it. It is an older NUC that I got about 7 years ago, so its not that fast. The “servers” in the hypervisor failover cluster are a Lenovo and 2 Dell Optiplex 5050s. I like these Dells because they go for about $200 on eBay, while having a Intel 7600 i5, can support 64GB of ram, and have expansion slots for things like 10gb SFP+ cards. These machines also idle at about 30 watts, which makes the power bill more reasonable.

Some of the services I run include:

  • 2 Domain Controllers (Server 2016, and 2019)
    • Including Routing and Access service for RADIUS and 802.1x on wifi on wired
  • Windows Admin Center Server (Windows Server 2019)
  • Windows Bastion (This box does Windows Management) (Server 2019)
  • Veeam Server (Server 2019)
  • Unifi Controller/Unifi Video for security camera (Ubuntu)
  • 3 Elastic Search boxes for ELK (CentOS 8)
  • Linux Bastion (CentOS 8)
  • Foreman Server (CentOS 8)
  • LibreNMS (This I grew to really like) (CentOS 8)
  • Nessus Server (CentOS 8)
  • Jira Server (CentOS 8)

That is the general overview, I will spend the next while diving into each bit and discussing how it is configured and what I learned in doing that.