Agent keeps going inactive / disconnecting

OutagesIO_Support

Hi,

Thanks for using OutagesIO.

First, let me show how the statuses are determined.
Trace basically means the last time the network heard from the agent.

ACTIVE = up to 19 sec (from last trace)
INACTIVE = from 20 sec to 29 min 59 sec (from last trace)
DISCONNECTED = from 30 min to 29 days 23 hours 59 min 59 sec (from last trace)
ABANDONED = from 30 days on (from last trace)

As a starting point, we'll try to narrow down if we're looking at a network, server, or software problem.

If you haven't done this, can you try this?
When the agent status is showing inactive/disconnected, and you log into your server, does it actually still have Internet access?

Meaning, can you still browse to sites but more importantly, can you try to ping tpw.outagesio.com and www.foxymon.com which are some of the receivers of agent data in our networks. If you can, watch the agent status on OutagesIO to confirm the status is not back to green while running the tests.

If you can't reach anything, then it might be a network, Internet, or server connectivity problem.
If there is connectivity, then this will tell us that the problem is likely with the server agent service.

Like most troubleshooting, we have to try and eliminate what we can to get to the problem.

I hope this is a reasonable and logical starting place for you because if we eliminate the server, then we'll know it's something else from the interface out.

mshafrin

Thank you for the reply......

Yes, I can log into the server, and have fully internet connectivity.

I was able to successfully ping tpw.outagesio.com, but received a "request timed out" error when attempting to ping foxymon.com (100% loss).

The agent status still shows as "red / disconnected" during and after the ping tests from the server.

OutagesIO_Support

Ok thanks for trying that. We just want to confirm these things.

Yes, I realized after that foxmon.com doesn't respond to ICMP. I've had that changed so that we can have an additional testing point.

As you mentioned, restarting the service seems to be taking care of the issue which implies that the server is having a problem with the service. While this is rare, it does happen as we've seen lots of posts to that effect.

There could be so many reasons for that which makes it hard to diagnose but in this case, I think I already see what's going on. When the agent was installed on the server, was it done so with full administrator access?

While I can see pings, I'm not seeing any hops coming in (Historical/Hops). This means that the agent is not able to communicate properly with our network and so it's not able to collect or send all of the information you would need to know what's happening with the connection.

It's possible that the installation didn't fully enter the firewall settings it needs to allow the agent to work correctly.

This post has information on how the firewall should look.
https://support.outagesio.com/topic/103/creating-agents-troubleshooting-install-problems

If the firewall settings are correct, and restarting the service is the only remedy, I can tell you from experience in these forums that this will be hard to diagnose. You would have to look at the server app logs to get some idea of what is happening. Most likely, the service is getting shut down by the server due to some other software running on it. That software would need to exclude monitoring the agent.

Another option is to use a Linux server if you have one or I see that you've ordered a hardware agent so that should eliminate this issue unless something else, upstream on the network is blocking the agent after a while. Unlikely since restarting the service fixes the problem.

BTW, we noticed that 'Back online' notifications contained the wrong information and that will be fixed today some time to correctly reflect the message.

mshafrin

I installed the agent using my admin account. I have also double checked the windows firewall, and traffic for Echo Networks is allowed for both public and private. I also white listed tpw.outagesio.com and foxymon.com in my Barracuda gateway, as well as my Cisco ASA / firewall.

OutagesIO_Support

Glad you are able to confirm all these things though that leaves us with something on the server is shutting down that service or preventing it from working after a while.

There are several posts that have dozens of comments in them trying to solve this kind of problem.
We've spent months testing on many servers and see that sometimes, combinations of software on one machine work fine while on another, it just doesn't seem to want to work properly.

You can try looking at the app logs to see if there's something in there. My guess is that something on the server is causing the agent service to shut down. When you turn it back on, it's fine for a while then it gets shut down again. I doubt it's crashing considering the amount of testing we do but that's always a possibility too. Only the logs will tell or extended logging.

OutagesIO_Support

The fact that it's not sending hops alone also means that something is preventing those from being sent or received. Are you able to run a tracert to the same domains from that server?

mshafrin

Good morning.....
I ran tracert from the server running the software agent, and it seems all "hops" after the first one to the router this server is pluged in to (192.168.1.1) time out. I am honestly stumped.

OutagesIO_Support

Well, at least we have a lead right.
Now, do you have another server or device you could ping from that is connected to the same network?
Something that's connected to the same switch or router that this server is connected to.
I mention that because sometimes there are multiple switches and/or routers.

If you can test pings / tracert from another server/device on the same network, that might give you another lead. If you see all responses, then it leads us back to the server the agent is running on. If there are no responses, it leads us to the switch or router that is upstream possibly blocking ICMP at some level or another.

Properly secured equipment often limits the amount of ICMP traffic to lower the chance of being scanned and other hack attempts or simply to lower resource usage.

Maybe something upstream is simply blocking ICMP when it thinks there's too much traffic coming from something inside or outside the network.

ICMP limiting is a common thing and that could cause you to see results then none after what it deems too much traffic.

mshafrin

I was able to ping both URLs from another machine on the same switch, but tracert would give the same "time out" result as the server. As far as I know, ICMP isn't being blocked by the firewall, but I will check that.

OutagesIO_Support

What's interesting and maybe just a coincidence is restarting the agent service seems to fix the problem temporarily. That implies that whatever is upstream is allowing some ICMP traffic and then blocking it.

FYI, the agent uses standard port 80/443, and ICMP. While ICMP is a small part of actually detecting an outage, it's an important part because it's not only part of the heartbeat to know if it is still communicating but also for the agent to send updated hops changes and other tests.

OutagesIO_Support

If your office has a managed services arrangement with an outside company, it might be worth asking them if they have any ICMP (and/or other) limiters put in place. If it's not them and you can't find anything in the building, then it could also be the provider.

OutagesIO_Support

Hi, as an update, 'Back online" email is correct now.

mshafrin

Good afternoon.....

I believe I've got everything resolved. So, Cisco disables / blocks ICMP traffic by default to any outside interface. Once I allowed ICMP traffic on my firewall / ASA, my "hops" started showing in the history, and tracert to foxymon.com and tpw.outagesio.com (as well as to any other url) were no longer giving a "time out".

Thank you to everyone for the help!

OutagesIO_Support

Hi,

That's great to hear and I now see hops coming in which means everything should work now.
It also means your hardware agent should function perfectly once you receive it.

mshafrin

I actually just submitted another question regarding the hardware agent.... it is not showing pings or hops, but is connected to the same switch as the server hosting the software agent that is now working fine.

OutagesIO_Support

That's a bit humorous. We were just sending an email to that agent owner about that then noticed it's the same address.

I've asked our dev to look into this one because now I'm stumped and need more input on why it would see a local outage without ICMP.

mshafrin

The "outage" may have been my fault... I switched to a different port on the switch hoping it would possibly correct the no ping / no hops issue lol.

OutagesIO_Support

As far as I understand it, that should have created only an Inactive, not an outage. Give us a few minutes to look into this (ID 130432).

mshafrin

I don’t know if this means anything, but the date and time shown for all the events are incorrect. It’s showing 8-16-23 at 11am, but it’s almost 3:30pm here by me in NJ

SBK

Hi,

First I need to confirm with you that the agent was activated by you on 2023-08-21 11:25:03 UTC time i.e today at 7:25 am NJ timezone

Information and Support

Agent keeps going inactive / disconnecting