Immediately I started thinking of how this could turn into an interview question: 3 servers (out of many) inexplicably start going dark every ~7 days. One time it took 6 days and 21 hours, another was 6 days and 23 hours. But always down, always “on time”, continuously happening for over a month so far. How would you investigate this to figure out what’s happening?
The first time it struck, it registered on the radar as an odd event in the recesses of my mind. Why those 3? Did Dev update something? Was it patch day? Full moon maybe? They ended up bouncing the servers and called it a day. I think I can speak for everyone and say we’ve all seen odd things happen once in a while—maybe we lucked out.
A week passes, enough time for me to forget about it. Then it happens again—the subconscious sysadmin spidey-sense kicked in but due to the time gap I struggled to connect the dots— huh…that is weird. This seems oddly familiar but I can’t figure out why. Servers were bounced again.
Third time around: ok, enough. We gotta get to the bottom of this. What is really going on?
“Once is happenstance. Twice is coincidence. Three times is enemy action.”
Ian Fleming
Here is what we observed on the affected machines:
- We can connect with our remote desktop tool, and the machine acts normally. Not sluggish or “off” in any way that we can tell
- You can’t browse the internet or any file shares
- The machines are pingable but anything served up normally is not available and the connection fails
- A reboot “fixes” the problem
Maybe a better question to ask is — what changed? This is something I have learned over time: rarely, if ever, do things just happen on computers without any reason. Something has changed, something is different. It is simply up to you to find out what it is and where the evidence lies. Also: logs are one of the great gifts to the computing world—they will tell a story if you spend the time to read them. Throw a little Occam’s Razor in there for good measure.
Everything had been happily chugging along up to this point, yet here we are. Ergo, something changed.
First step: preserve the crime scene and gather evidence. I exported the Application/System/Security event logs and combed over them—standard procedure and business as usual. There’s a saying: Everyone has a testing environment. Some people are lucky enough to have a totally separate environment to run production in. Luckily in our case, we have both. Starting now, everything I will show below was run on an affected machine in our dev environment.
Side note: I have a Powershell script I made that creates a folder, uses wevtutil to export Windows Event Logs, Compress-Archive to zip it into a file, then emails me the result. Yes, any decent RMM will have that data available, but I find this script very convenient – this way I can drag and drop the .evtx file directly from email into Event Log Explorer or the native viewer to play with it locally.
So, what did we see? A whole lotta nothing, that’s what, at least in the beginning. One thing that I think discourages a lot of techs when reviewing logs is the sheer number of entries, often reaching thousands or tens of thousands—it all looks the same and eyes grow weary. You have to use the data points you do have in order to find the ones you do not. Find a way to whittle things down. Ask yourself “what is this telling me?”, then “what is this not telling me?”. Looking at it from both angles will help identify the gaps.
Armed with the timestamp of the outage (via our netmon system), events are revealed that correlate: 4227, 4231, and 5719. Are they the cause, or the effect? How do they fit into this?
Before now I had never seen a machine error out with these entries. Thankfully, they seemed direct and to the point which isn’t always the case (I’m looking at you, SQL: “Warning: Failure to calculate super-latch promotion threshold.” Good luck with that one. I’m not a DB admin and don’t plan to be anytime soon).
With the logs in hand and certain words like “ports” and “outgoing connections” in mind, I was pretty convinced I could catch the offending process using Nirsoft’s CurrPorts. My grizzled-veteran colleague was confident that Wireshark would reveal all. The game was afoot and we felt confident at least one of us would be victorious. The trap was set and all we had to do was wait.
…which just led to more waiting because we didn’t find any answers here. The CurrPorts log file was suspiciously normal-looking. Even without pulling it into Excel I could tell it didn’t hold any secrets—it showed a very normal progression of connections being made and terminated, in roughly equal numbers. The pivot table showed a near perfect 1:1 ratio of started to ended connections over a 5 day span:
Note: the names have been changed to protect the innocent. I’ve subbed out certain things for self-explanatory, generic values: e.g. DnsFilterService, RMMSvc/RMMtray = our RMM tool, EDRSvc = our EDR, AVSvc = Anti-virus, etc etc. Same with IP addresses.
The review of the Wireshark logs showed the same; nothing really stood out in my own analysis (admittedly, not the sharpest tool in the shed) or my colleague’s (much smarter) examination.
"Hindsight is 20/20"
Ancient Proverb
Looking back, I got caught up in the chase. I was so locked in on getting my logging right and catching this that I forgot one of my tenets: start with simple, and move upwards and outwards from there. I hadn’t seen those event log entries before, and up until now I hadn’t thrown them into Google. Lo and behold—it gave me about 30k results that all seem to mention the same thing: port exhaustion.
A 14 year old Microsoft blog post immediately stood out. Makes me wonder what their threshold is for deciding when they stop hosting them; I have seen some like this which are ancient in web-years while still hosted, others are lost in the ether and only found with archive.org or similar tools.
In the post, the author states:
Port exhaustion can cause all kinds of problems for your servers. Here’s a list of some symptoms:
– Users won’t be able to connect to file shares on a remote server
– DNS name registration might fail
– Authentication might fail
– Trust operations might fail between domain controllers
– Replication might fail between domain controllers
– MMC consoles won’t work or won’t be able to connect to remote servers.
That’s just a sample of the most common symptoms that we see. But here’s the big one: You reboot the server(s) involved, and the problem goes away – temporarily. A few hours or a few days later, it comes back.
Oh.
The author continues on and gives a simple breakdown of what dynamic ports are and the ranges used by different versions of Windows. In the end, he states that the way to find the perpetrator is to run a simple command:
netstat -anob
Cool. We’re close to the end of this trek. Reading the info on netstat, that seems like it’ll work:
Wrong again. Reading the netstat -anob report gives me the same feeling as the Currports log: underwhelmed. The output shows exactly what you’d expect on a machine running normally. Back to Google—in the same lineup of search results is another (much newer) Microsoft article – “Troubleshoot port exhaustion issues”
Like the older article, it defines 2 types of ports:
– Ephemeral ports, which are dynamic ports, are the set of ports that every machine by default will have them to make an outbound connection.
– Well-known ports are the defined port for a particular application or service. For example, file server service is on port 445, HTTPS is 443, HTTP is 80, and RPC is 135. Custom application will also have their defined port numbers.
And a command to find out how many are available on a machine. This is for IPv4:
netsh int ipv4 show dynamicport tcp
Which, on Windows Server 2019, gives us 16,384:
This seems low to me. I run it on my Windows 10 machine and get a much larger number: 64,511. Interesting, and worth noting—as an admin, there are random bits of info that you come across that might save your ass or win a bet, this seems like one of them.
The article goes on to list common event log entries seen during port exhaustion, plus a familiar phrase: Reboot of the server will resolve the issue temporarily, but you would see all the symptoms come back after a period of time.
About 2/3 of the way down, it mentions netstat, but with a different switch added:
netstat -anobq
Interesting. Running -anobq doesn’t seem to change much in the output. Running /? I find that the -q switch “Displays all connections, listening ports, and bound nonlistening TCP ports. Bound nonlistening ports may or may not be associated with an active connection.” I tried running it with just -q and now I see a ton of entries just like this:
An alternative Powershell command is also listed:
Get-NetTCPConnection | Group-Object -Property State, OwningProcess | Select -Property Count, Name, @{Name="ProcessName";Expression={(Get-Process -PID ($_.Name.Split(',')[-1].Trim(' '))).Name}}, Group | Sort Count -Descending
It’s not DNS. There’s no way it’s DNS. It was DNS.
DNS Haiku
For kicks, I checked to see when our DNS filtering program last auto-updated. The last update was near spot-on 7 days before the servers first went down—close enough to be highly certain that the newly upgraded version of this program was the catalyst that put things in motion, resulting in the issue we’re seeing.
Continuing on with the Microsoft port exhaustion article: Some are normal, but large numbers of them aren’t (hundreds to thousands). Close the process in question. If that restores outbound connectivity, then you’ve further proven that the app is the cause. Contact the vendor of that app.
We restart the DNS filtering service, and within a minute or so everything is working again—the ports are released and the count begins anew. Hooray. But I can’t let it go just yet, I am still curious—what is actually happening with this program? I want to find out more.
Let’s recap: I have the command for finding the process with the most bound ports, and the command for listing the total number of bound ports. There had to be more I could find out.
Kind of like drawing, I feel like when I try to script things, I can see exactly what I want in my mind but between there and the actual product is discombobulation. Enter ChatGPT—after a little back and forth, I had myself a script that combined the 2 commands and provided me with that oh-so-wonderful gift: logs. At this point I was anxious to see the results so I turned the script on and left it running every 5 min via Task Scheduler until the machine went down again 2 days later.
- 1 – the log file showing the output of the Get-NetTcp command shown above. This I mainly used as assurance I was collecting the right data, and as a debug tool if needed
- 2 – the .csv file listing just the top process with the most bound ports, plus the number of ports (2nd column)
- 3 – the Excel graph showing the number of bound ports by the top process over ~2 days. Remember, the server has just over 16k available ephemeral ports, so that flat line is when it kicked the bucket
- 4 – the count and graph of ports in the bound state per each 5 min period that the script ran, plus the calculated ports/min
The .csv and the graph gave me some interesting insight: these ports were getting caught up at a very linear pace with a rough calculated rate of 1.75/min. The linear-ness of it caught my eye—in general, when you are tracking an event (ex. failed logins) and you see something metronome-like in its timing, a machine is likely causing it, not a human. That being said, there was a variance in the number bound during each 5 min period—the majority showed 8-9 but there were times it went up to 20.
Now I was locked onto the offending process and had good data, but what else was there to find?
This particular DNS filtering program does have logs, but they are limited in a way I hadn’t really seen before: they are dumped into a rolling group of 10 files; when the active log file hits 1024KB, it appends a number between 0-9 then creates a new log file, keeping a total of 11 files at any one time:
As you can see in the picture, that gives me roughly half a day’s worth of entries. [If you need to combine files like these, here’s a script that will create one file from a folder full of separate files.] As logs go, they are unfortunately bland—there looks to be some sort of DNS check every minute showing the set DNS servers for that machine (shown below), then every ~30min it looks to check in with the mothership and confirm that it has a valid license, sends telemetry, and pulls the latest config.
Using the app developer’s KB, I did find a way to turn on trace logging. This was good and bad; good in that usually more detail is better, bad in that this reduced the window that the 10 log files covered to just a few hours. I wasn’t sure if that would be a deal-breaker so I tried to find a way to hedge against missing something. Cue this next Powershell script, which dynamically tails a log file and outputs everything to a new file. Using this script I was able to capture every single line that passed through the root .log file, even when that file was later renamed and later destroyed. In other words, bottled at the source.
I felt like I was on the right track, but I wanted more detail. The Get-NetTCP command from earlier was good — it gave me a snapshot in time of the specific ports bound and the total number. Wanting to take it further, I came up with another Powershell script.
With the script running every minute, I could now check in on the machine and see exactly which ports were bound at that point in time. Even better, I now had a log to show when a certain port was added to the list. The screenshot below shows the first 13 times the script was run—you can see in the alternating highlighted groups that every minute, another 2-4 ports are bound. It still didn’t tell me what the DNS filter program was doing, but it lined up with everything I knew up to this point.
I knew the DNS filter program is chatty and reaching out to various things constantly — but was it happening in every attempted connection or just at certain points? Was it happening after a certain set of events preceding it? I wasn’t sure and couldn’t tell. So how could I catch it? Enter Sysmon.
I have a more in-depth post on how much I like Sysmon so I will be pushing on. If I am trying to catch something in Windows, it’s like the Eye of Sauron. As a starting point, I like using this modified config from Github. It has a good foundation and if I want to tweak something — like excluding our EDR/RMM netconns or process creations in this case — I can do so easily. I installed the service using the config and kicked it off. By default you get an event log size of 64MB so I immediately raised that to 1GB — we had the space and I didn’t want to miss anything. Turns out that was a good guess as I found that 1 GB’s worth of Sysmon got me ~2 weeks of data. And this is just for a dev machine, not in prod, YMMV.
Going through the Sysmon event IDs, I decided to focus on two in particular: event ID 3 for network connection detected, and 22 for DNS queries.
One thing that Sysmon does not capture is the actual event of a port being bound. After some digging, I found that you can use the Local Security Policy >> Advanced Audit Policy Configuration >> Object Access \ Audit Filtering Platform Connection:
Using this, I could capture event ID 5158: The Windows Filtering Platform has permitted a bind to a local port. To round everything out, I also decided to focus on event ID 5156: The Windows Filtering Platform has permitted a connection. Like with Sysmon’s named event log, be aware that you’ll likely need to increase your Security event log size if you’re going for tracking events over a decent timeframe. As a reference, setting the size to 800MB got me about 2 week’s worth.
Now, with the data from these 4 events, I believe I had the makings of a perfect trap and all would soon be revealed:
I read somewhere long ago that when you’re working with large amounts of data like this, you want to put your filtering as close to the source as possible. With that in mind, I created a custom filter view in Event Log Viewer to aid in exporting later:
As you can see in the earlier screenshot, the data shown in each log entry varies, and is not uniform. After capturing the events, I needed a way to modify the data so that I could read everything, all in one sheet.
This next part took some time and effort: I exported the filtered events to an Excel sheet, exported each event ID to it’s own .csv, used the “Text to Columns” feature to break the log entry data into separate columns, then culled the data to only the necessary columns, excluding everything else. Even with this, it resulted in 4 different .csv files with different headers:
I’ve been in this spot before. I knew what I wanted, but I just needed a little help with taking the ideas in my head and putting it all together. Some more back and forth with my artificial friend and I had a script that would read a set of .csv files, pull out all the header values and combine them, then go through all the .csv files again adding the new headers and inserting a null value where needed. Once that is done, it creates one final .csv file with all the data from all the files. The resulting file had the data from the 4 original event IDs, in one file, under one set of headers, which I could transform and pivot like I wanted:
Now why did I go through all of this work? Because I wanted to see if I could find the full ‘lifecycle’ of a port on this system. I wanted to know if there was a specific action that our DNS filtering program was taking which then resulted in the port being bound forevermore, or if the program itself just had the touch of death and it didn’t matter. And the only real way I could think of doing that is by tracking every single connection, every port bind, and putting it all on a timeline. Using the results from the previous Get-NetTCP logging script, I didn’t have to wait for all the ports to be bound and for the machine to go down—I could now record data for a few hours then hone in on certain port numbers, knowing that they were locked out and worth looking at for more info.
I created a pivot table from the data above – Source Port > Datestamp > Application Name > Action > Destination IP > Destination Port. Good, but messy and hard to read:
Using a few different options, I was able to clean it up and make it readable:
- Right-click on the Hours field, then click Ungroup
- Set the label filter to show only ports 49152 and higher
- On each level underneath, right-click and select Field Settings, then select Subtotals:None and Show item labels in tabular form
Behold! Feast your eyes on the journey of port 62716:
Notice how the port is used by multiple programs, and connects to multiple IP addresses. The localhost entry shows our DNS filter app in action—it intercepts all outbound DNS requests and redirects them to a listener on 127.0.0.1. If it deems the request as acceptable, it forwards it to the set resolvers for that system. Port 62716 is also used by our RMM software to connect back to the mothership. Simply put, this is what I expected ‘normal’ behavior to look like for an ephemeral port over time.
Now let’s look at the history of multiple ports that were showing as bound by our DNS filter program and not released—a pattern emerges: a net connection is made to 169.254.169.254:80 and the port is then locked up until the service is restarted. This, my friends, is the exact behavior I’ve been looking to track down.
Now, seeing 169.254.x.x isn’t a new thing to me. Ever had DHCP go wrong? If so you’ll likely know what I mean—that range is used in APIPA – Automatic Private IP Addressing, where your machine chooses an IP address from that range if it does not get assigned one.
Also known as Link local addressing, there are some special rules that go with the territory:
- In computer networking, a link-local address is a unicast network address that is valid only for communications within the subnetwork that the host is connected to.
- Link-local addresses are not guaranteed to be unique beyond their network segment. Therefore, routers do not forward packets with link-local source or destination addresses.
A little time on Google and I had the connection: 169.254.169.254 is a special address used by cloud providers to host information about that instance, accessible only from that instance. AWS | Azure
From a browser on the machine, you can hit up that address and drill down through the versions and pull various values (pic is an example from AWS):
Side note: Not only that, but I found that AWS hosts an NTP service on port :123—pretty cool if you are setting up isolated networking. When a machine is provisioned, their NTP server is automatically registered:
Back to it. More things started falling into place now. Looking through Wireshark captures, I found that the machine was making a GET request to 169.254.169.254:80 for /metadata/instance?api-version=2021-02-01 and was receiving a 404 error in return. This matches with how Azure provides info via their Instance Metadata Service. The machines having problems were not hosted in Azure, and therein lied the problem.
Best I could tell, our DNS filtering app was making this request to an Azure-specific link-local address, but it received an error in response. Something in that chain of events resulted in the port not being released for re-use. Multiple that by 1.75 ports/min, over roughly 7 days, and eventually all 16k ephemeral ports would be ensnared, causing the machine to lose its ability to reach the outside world and then in turn bringing the house down.
Don’t get me wrong, I felt relieved. Still, I wanted to know the boundaries of this behavior—is it just relegated to virtual machines? The answer is yes; I ran the tests on multiple physical machines and there were no attempts to that IP showing in Wireshark.
What about an on-prem VM? I spun one up to check. The attempts were there alright, just without getting any response back. I had one last nagging thought regarding the original DNS filter app logs (without trace logging enabled). Did I miss something all that time ago? I changed the Wireshark capture to timestamp mode and opened the app logs back up.
Sure as sunrise, there it was, there the whole time. The vendor confirmed that it checks to see if it’s hosted in Azure and tries to pull data from the metadata service. If no response it waits 1 second, then 2, then 4, then 8, before finally calling it quits and declaring, and I quote:
AzureTestUisng [sic] Metadata not detected.
On to the next one…
You must be logged in to post a comment.