Field Office | Video 06 | Templates

2022-11-14T00:00:00-08:00

Marty the OT Guy walks you through how to diagnose the following network problems, and how to do it in the lab: 00:00 Introduction 03:17 1. Network retransmission 11:07 2. Throughput rates 16:29 3. Down network links over time Thanks, Marty!Marty the OT Guy walks you through how to diagnose the following network problems, and how to do it in the lab: 00:00 Introduction

Categories:

Stacks:

Tags:

Transcript

0:00 foreign
0:02 [Music]
0:12 the OT guy here
0:15 um back today to talk about some things
0:16 you can do with nozomi Network's
0:19 products that can help operational
0:21 technology teams diagnose network
0:23 problems
0:25 we'll start off start off with a funny
0:27 story
0:29 um we were dealing with a remote site
0:32 here in New Zealand right down in the
0:35 South Island wonderful South Island for
0:36 anyone who's been there Lord of the
0:39 Rings and all those things and we had a
0:42 wind farm uh the primary comms
0:46 connection was via satellite
0:49 and over time
0:51 um we found out the hard way of course
0:53 we didn't we didn't find out while this
0:55 was happening but over time the
0:58 satellite Communications were dropping
1:00 off and it was progressively getting
1:02 worse and worse and we couldn't really
1:04 understand what was going on
1:06 um and it's a long way to get to this
1:08 place it's very remote
1:09 um we're flying from one end of the
1:11 country to the other we then have a oh I
1:15 think it was nearly two hour drive
1:17 um 4x4 off-roading and to get to the
1:20 site and
1:22 yeah we couldn't work out what was going
1:24 on with um with these comms issues and
1:26 what we really needed was a way to
1:28 diagnose the problem
1:31 um and of course as you do you work out
1:33 how to do it when you've got the tools
1:35 and after the fact but so what would
1:37 happen was the um satellite
1:39 Communications would misbehave and it
1:41 would it would go in and out or it'd be
1:43 shaky for short periods of time and
1:45 things like that and progressively over
1:47 time it got worse and worse
1:48 so
1:50 when we did the site analysis and we
1:52 finally got to have a look what we found
1:55 out was some fencing was broken and some
1:58 local cows had managed to work out that
2:02 a really good way to get a back massage
2:04 was rubbing up against the um the
2:08 transmission arm and lnb on a 1.2 meter
2:12 satellite dish so whenever the cow
2:14 needed a back scratch he would wander up
2:17 sidel up against the Satellite Dish and
2:20 get jiggy and have a good rock around
2:22 um and then as he went back to the back
2:25 to the herd and told some of the other
2:26 guys
2:27 um progressively they they came along
2:29 and they they're all rubbing up against
2:31 the Satellite Dish and
2:33 in the in the first instances we we
2:35 could only assume that while the cow was
2:37 rubbing
2:38 um of course he's rocking the whole
2:40 satellite dish which is knocking the
2:41 alignment out and then progressively as
2:43 they used the actual dish itself it
2:46 slowly rotated the disc and knocked it
2:48 out
2:49 um the dish rather and knocked it out of
2:50 alignment so yeah we worked this out one
2:53 day when we were actually there on site
2:55 and the cow walked up and did it right
2:57 in front of us that was our that was our
2:59 diagnostic tool but over time it would
3:02 have been so much easier if we could
3:04 have whipped it out on the fly so yeah
3:07 today's session let's share three ideas
3:09 three ideas about
3:12 um how you can diagnose network problems
3:14 using the zombie networks tools
3:16 [Music]
3:22 okay so the first thing I want to talk
3:24 about is monitoring uh Network
3:27 retransmission so when you've got a
3:31 nazomi networks Appliance a guardian
3:32 Appliance in your network and you are
3:35 monitoring links and sessions and
3:37 throughput and all the good things that
3:39 comes with that you have the ability to
3:41 look for re-transmissions how do we pick
3:43 up rear Transmissions because it's on
3:45 the network you're going to see it if
3:47 you if you took a packet catcher you're
3:49 going to see re-transmission data inside
3:52 that packet capture
3:54 so we can set up alerts
3:57 um and things called assertions and
3:58 assertion is sort of like a question
4:00 where you say
4:02 um if a occurs in B time
4:07 um then create an alert or you can just
4:10 have an assertion that just if a occurs
4:12 in B time
4:14 just assert let us know it doesn't have
4:16 to alert
4:17 um so that's really how you do it you
4:19 would set up a monitoring query that
4:23 would monitor the network link that
4:25 you're looking for and when the
4:26 re-transmission rate exceeds a set point
4:29 exceeds a given set point
4:31 um yeah you can create an alert from
4:33 that so that's kind of cool and very
4:35 useful for monitoring
4:37 um important
4:38 um you know critical links primary links
4:40 or something like that and it reminds me
4:42 of another another story and we did this
4:44 with a customer recently where during um
4:47 during a trial session during a proof of
4:49 concept the um the customer commented
4:52 that the communications to this one
4:54 particular remote site
4:57 um were always a problem and we looked
5:00 at it and there we were able to help
5:02 them diagnose really quickly just by
5:04 picking up that re-transmission rate
5:07 um and showing how bad it actually was
5:10 we're able to diagnose really quickly
5:12 that the microwave link they had in
5:14 place so microwave Dish One in micro
5:16 efficient data the microwave link they
5:18 were using and the dishes were slightly
5:20 out of alignment so they they sent her
5:23 technician out to site who
5:26 rotated the dish slightly got it back in
5:28 alignment and all of a sudden all the
5:30 re-transmission went away
5:31 um
5:32 so yeah the the proof in the pudding
5:35 there was it was really easy for them to
5:37 be able to to
5:40 to not go
5:42 um you know we've got this problem I
5:44 don't know what to do with it it was
5:45 we've got this problem we've proven it
5:47 we can fix it it's a really fast
5:49 turnaround on that one so that was cool
5:51 that was a good one
5:57 foreign
5:59 so let's look at the equipment we're
6:01 using to demonstrate to do this
6:03 experiment in the lab
6:05 so I've got some cheap CCTV cameras uh
6:08 connected to my lab Network they go
6:10 through a variety of infrastructure and
6:12 end up at a network video recorder and
6:15 along the way my nazomi Network's
6:17 Guardian Appliance is monitoring the
6:19 traffic
6:21 taking a look at the query you can see
6:24 here screenshot from the guardian
6:26 Appliance I've entered the query you can
6:28 see the results there's some fairly High
6:30 re-transmission rates in there this is
6:32 deliberate it's the way my lab is set up
6:34 purposely for this experiment but let's
6:37 dive deeper into how this query works
6:40 and we've presented it here in what I
6:43 like to call Vantage format and the
6:45 difference between our Vantage SAS
6:48 platform and the nozomi networks
6:51 Guardian Appliance
6:53 um it's the way that the the queries are
6:56 presented visually so in the guardian
6:59 Appliance you see it all as one big long
7:01 string whereas Vantage format which is
7:04 that's my term breaks it down into
7:08 separate lines which makes it a little
7:09 easier to understand so let's take a
7:11 look at what's happening on the first
7:12 line here we're referencing the links
7:14 table
7:15 the vertical line at the end the pipe
7:18 then sends that on to the next command
7:19 so second line we're saying where
7:22 tcpretransmission dot percent is greater
7:25 than 10.
7:26 that's all it's self-explanatory it's
7:29 really really simple so we're pulling a
7:31 data set from the links table of all
7:35 records where re-transmission percentage
7:38 is greater than 10. we pipe that through
7:40 to the third line and we do a select
7:42 command because I want to display this
7:43 data in a meaningful format and I've
7:45 deliberately used some features here
7:48 just to make it a bit easier to read you
7:51 don't have to do it this way so starting
7:54 from the left we select from then you
7:57 see the right arrow which is a hyphen
7:59 sign and a greater than and we're taking
8:01 lowercase from two from with the capital
8:04 the right hand side of that expression
8:07 becomes the new name for the column on
8:10 the report
8:11 So reading through it we select from we
8:13 select two we select TCP retransmission
8:17 percent and we've renamed those to from
8:19 and two with a capital f and T and
8:23 retrains percent as a more readable
8:25 result and you can see in the screenshot
8:27 below that's our report that's come out
8:30 nice and simple
8:32 so let's look at this in a slightly
8:35 different way what if we don't want a
8:37 table as an output let's look at a
8:39 network graph query why can't we create
8:41 a network graph so this is a similar
8:43 this is using similar inputs to get
8:48 um a graph output it could be easier to
8:50 read it might make life simpler for the
8:52 um for the users or for the engineers
8:54 that need to need to work on this
8:58 so let's break down the query again here
9:01 in Vantage format this time it's a bit
9:04 more complicated because to use the
9:06 network graph feature we have to be
9:09 referencing the nodes table so we have
9:11 to do a join the first line we select
9:14 the nodes table we pipe that to a join
9:16 command which is the second line the
9:18 second line is joining the nodes table
9:21 and the links table using the IP
9:26 column from the nodes table and the two
9:29 column from the links table as the
9:32 common call it a key if you like like a
9:35 database key so the first two lines were
9:38 saying take the nodes table take the
9:40 links table and I want a data set that
9:44 my my resulting data set needs to be all
9:47 of the entries where IP and 2 have the
9:51 same IP address in them we then pipe
9:54 that into a where command
9:55 so we're saying we're join the link ip2
9:59 that is a new column that's created
10:02 um as part of the the resulting data set
10:05 joined link ip2 TCP retransmission
10:08 percent greater than five this time I
10:11 went for five percent it obviously you
10:13 change that to suit the threshold that's
10:17 um that's relevant in your environment
10:19 finally we pipe that through to the
10:21 graph command
10:23 um in order to use this graph command we
10:25 have we ideally we set three um
10:29 we set three parameters so the first one
10:31 we're setting is node label which is IP
10:34 address the second one we're setting is
10:37 the node perspective and we're saying
10:39 use the roles so I want to see each
10:42 individual device I want to know is it a
10:45 producer is it a consumer a server the
10:48 the role it plays within the network and
10:50 the third parameter we're testing is
10:52 link perspective TCP retransmitted bytes
10:56 so that means that the arrows showing
11:00 the links will change color depending on
11:02 the level of retransmission
11:05 present on each leg
11:12 the second one we're going to move on to
11:15 is about Network transmission rates
11:18 throughput rates so we had a customer
11:21 um recently reasonably recently who
11:24 needed to be able to monitor the average
11:26 throughput over time for given
11:29 transmission links given communication
11:32 links so in in their case they they had
11:35 a radio connections between sites and
11:39 they needed to know that the
11:42 communication throughput stayed within
11:45 certain limits over a seven day period
11:48 or something like that they had to
11:49 report on that as part of their
11:52 performance metrics
11:54 so they asked us to design and Implement
11:56 a feature which allowed them to do that
11:58 so you can now we're now able to look at
12:01 Network throughput for any given link
12:04 and alert or assert based on a high or
12:08 low level so for instance if you've got
12:10 a radio link from point A to point B
12:14 and you know that it runs at a constant
12:17 50 megabits per second or fairly
12:20 constant 50 megabits per second you can
12:23 put some alerting figures let's for
12:25 argument's sake let's say at 40 and at
12:28 60. so we have 40 and 60 as our lower
12:31 and upper limits we have 50 as our as
12:34 our normal operation level and if the
12:37 network traffic Peaks outside of 60 or
12:41 drops below 40 for a specified period of
12:45 time we can raise an alert based on that
12:47 why is that important well you could
12:51 have if you're losing network report if
12:54 it's dropping off it may be indicative
12:56 of devices that are failing in the field
12:59 you might have something that's given up
13:01 and it's starting to drop away or the
13:03 throughput drops away because the device
13:05 has failed it may not be a critical
13:07 device
13:09 um and in perhaps you don't detect it or
13:14 some other yep some other part of the
13:15 the network that that you can't
13:18 necessarily pick up through a critical
13:20 failure if you have a spike in traffic
13:23 that could indicate that someone's added
13:25 something new to your network
13:27 some behavior is changing the throughput
13:30 maybe it's re-transmissions again maybe
13:32 there's a whole lot of messages going
13:34 backwards and forwards just for that and
13:36 it's taking up more and more Network
13:38 throughput
13:39 [Music]
13:44 okay lab session two
13:47 for this experiment we're using a PLC an
13:50 HMI and Scatter workstation and an
13:52 engineering workstation within my lab
13:54 environment my nozomi networks Guardian
13:56 is monitoring the traffic
14:01 so let's take a look at this query
14:03 this one's a little different from the
14:04 last one we used
14:06 um because I want the output of the
14:09 query to be an assertion
14:11 uh an assertion think of an assertion as
14:14 a true false yes no on off red green
14:18 binary output so in this case we've got
14:22 it showing up um it's giving us a green
14:25 line a green box there to say that this
14:28 assertion is is okay
14:30 and what we're doing here is we're
14:32 looking at traffic between two looking
14:34 at link traffic between two nodes and
14:37 saying hey if if it's not transmitting
14:39 enough traffic or if it's transmitting
14:41 too much traffic
14:44 um across a 15 minute period Then can
14:46 you can you please let me know about
14:48 that
14:49 let's break the query down
14:52 so start with the links table pipe that
14:56 through to the first where command on
14:57 line two where we're selecting the one
15:00 end of the of the link which is in this
15:03 case it's a DOT 50.35 which is the scada
15:05 workstation
15:07 pipe that to the third line we're
15:09 selecting the two end the other end of
15:11 the link which is dot 50.60 that's my
15:14 small plc
15:16 pipe that through to the fourth line
15:19 we're looking at transferred last 15
15:21 minute Bots how much traffic went
15:24 through over the last 15 minutes and
15:26 we're saying here if it was less than a
15:28 megabyte then I want to know about it
15:30 and pipe that through the fourth line
15:32 for fourth line being the same condition
15:36 however we're saying if it's greater
15:39 than 100 Meg this time so really what
15:41 I'm doing is I'm saying is my PLC
15:43 transmitting any traffic at all uh and
15:46 because this is just for a lab example
15:48 I've deliberately used really wide
15:51 um parameters there you could narrow
15:54 that down you could say I've got a link
15:56 uh between this
15:58 station and that station and I know that
16:02 the traffic should be 500 kilobits plus
16:05 or minus 50 kilobits and you could
16:08 change the parameters in here and really
16:10 narrow that up
16:12 and what we're saying here that final
16:14 part on line one two three four five
16:15 right at the very end
16:17 um we pipe that through and say assert
16:20 empty so we're saying if the traffic is
16:25 inside the parameters we want I want a
16:27 green box
16:34 have you ever had those times in your
16:36 career when you submit some compliance
16:39 reporting or some trends for analysis
16:43 and someone comes back to you and says
16:46 can you please explain this Gap in the
16:48 trend for let's say two days there's a
16:51 gap in the trend and you went I'm sorry
16:53 what Gap in the trend or you maybe
16:55 didn't know that that data was missing
16:57 you found out the hard way
17:00 how do we solve that what can we do
17:02 about that well we are able then zombie
17:06 networks product is able to monitor when
17:10 a device lasts communicated so it's sort
17:14 of like throughput
17:15 um but instead of looking at volume
17:16 we're now looking at time so we can set
17:19 up alerts and assertions so that if a
17:22 device doesn't communicate for a given
17:24 period of time we trigger an alert let's
17:27 say you have a 12-hour compliance
17:30 reporting requirement you you need to
17:33 maintain your reporting and trending and
17:36 you're not allowed to have any outages
17:37 greater than 12 hours it'd be a really
17:40 good idea to detect a a loss of
17:44 communication a link drop or yeah or
17:48 last last time that a device was
17:50 communicating
17:52 um maybe at eight hours maybe you pick
17:54 it up at four hours and that gives you
17:57 plenty of time to do diagnostics and
17:59 hopefully fix the problem problem before
18:01 it becomes a compliance issue
18:10 okay lab session three same as before
18:14 we're using the PLC the scada HMI
18:17 workstation and the engineering
18:19 workstation with the nozomi networks
18:20 Guardian monitoring the traffic so this
18:23 query again we're using we're using an
18:25 assertion but this one is misbehaving
18:28 this time you can see it's got a red
18:29 line so it's a bit different
18:31 let's break it down again starting with
18:34 the links table
18:35 well with this time I'm excluding some
18:38 traffic I I don't want uh traffic the
18:42 0.0.0.0
18:45 um address in my nazomi networks
18:48 Guardian configuration
18:51 um is used as a catch-all for all of my
18:53 public IP addresses or anything that's
18:56 not within my lab so we're saying we
18:59 don't I'm not interested in links that
19:01 are trying to reach the internet or
19:02 trying to get to my home network or in
19:04 any way related to my management or
19:06 anything like that
19:09 um let's look at that third line where
19:11 to exclude
19:13 224.0 okay so I want to exclude 224.0
19:17 addresses which you know they're a
19:18 management address I don't want to know
19:20 about it
19:21 next line down where to exclude FF I
19:24 don't want IPv6 in this response uh in
19:26 the result rather from this query
19:29 and then
19:30 next line down so we're now on one two
19:32 three four five where two is not equal
19:36 to my public catch-all again so I don't
19:38 want traffic going to or from there
19:42 now we get into the interesting stuff so
19:45 the next line down we're saying where
19:47 the last activity happened greater than
19:49 two hours ago so if I know that I have a
19:52 PLC that should communicate roughly
19:54 every hour
19:56 um then I can set that to be greater
19:58 than two hours we haven't had a response
20:01 something might be wrong
20:02 we come down again and we say where the
20:04 transferred average packet has dropped
20:07 below
20:08 um below a thousand so now we're saying
20:11 it's been greater than two hours and our
20:12 average traffic is much much lower than
20:14 we expect
20:16 we have we maybe have some problems here
20:18 so I put in a a select line in there
20:22 this is sort of a a demonstration of how
20:25 not to make a query in a way we've got
20:27 select in there we're saying hey forgive
20:29 me from to transferred average packet
20:33 bytes but then we're piping through to
20:35 an assertion which means we're not
20:37 seeing the output of the select command
20:40 um so yeah that's a deliberate mistake
20:42 that's showing you how to well you can
20:45 believe it's deliberate or not I don't
20:46 mind but it's a deliberate mistake to
20:48 show you uh what would happen or how it
20:50 may not result the way you expect the
20:53 query still works properly
20:56 but what we don't get is the table the
21:00 data table coming out from the select
21:03 command
21:05 okay so this next Lab session we're
21:07 using the same configuration as before
21:10 but what we're doing different is we're
21:12 going to use the link State rather than
21:14 the links table itself so it's a
21:17 slightly different query
21:19 and let's take a look at it
21:21 the first thing we do is we start off
21:23 with the link events table
21:26 we pipe that through to our first wear
21:28 command so we're saying where the event
21:30 includes down so now we're looking at
21:33 links that have gone down it's not there
21:35 anymore
21:37 pipe through the next one where hours
21:40 ago time lesson 24. so we're saying what
21:44 has happened in the last 24 hours have
21:46 any of these links gone down
21:50 pipe through again and we've got a big
21:52 long select so we're sort of combining
21:55 everything that we've covered in these
21:56 Labs here so we're selecting ID Source
21:58 but we're renaming that to from
22:01 we're selecting Port Source renaming
22:04 that to from Port we're selecting ID
22:06 destination and calling that two we're
22:09 selecting Port destination and calling
22:12 that two port and we're selecting time
22:15 and calling it time and same with
22:17 protocol selecting protocol calling a
22:19 protocol with a capital P and you can
22:21 see there the resulting data set and
22:23 table that we've got out of that
22:25 so that's my three my three tips for
22:27 this this session if you're coming into
22:31 OT from the it will some of this might
22:34 seem a little unusual or a little
22:36 different if you're working on a campus
22:38 you're working in a large campus
22:40 how often a device communicates or or
22:43 even in some cases how much bandwidth
22:45 it's using isn't often that important
22:49 um it might not matter if a given
22:51 endpoint goes from 50 megabits a second
22:54 to greater than 60. who cares it could
22:57 be a use case thing but in the OT world
23:00 that could be indicative of some
23:02 problems in your network or or things
23:04 that could go on to affect production or
23:07 compliance so it's pretty important that
23:10 we keep an eye on those kind of things
23:17 so that brings us to the end
23:19 um we'll look forward to catching up
23:20 with you next time and I'm really
23:23 interested can we get some feedback from
23:25 from new people out there would you find
23:28 it interesting to learn about some Core
23:31 iot Concepts such uh you know design or
23:36 um
23:37 digital and analog inputs and outputs
23:40 versus
23:41 communication devices versus you know a
23:44 bit more technical detail on how OT
23:46 systems actually work rather than just
23:49 how we're using it within the nozomi
23:52 world the two things fit together really
23:54 well we could have a lot of fun
23:55 discussing some of those things if
23:57 you're interested in knowing more about
23:59 that let us know we'll see you next time
24:02 [Music]"}]

About Stacks Guru

Stacks Guru

Video Reference

Leave a Tip!

Field Office | Video 06 | Templates

Transcript

Search the page

{name}

{rating}

Stacks Guru

Built with

Stacks Plugin / Pro

Foundation 6

Total CMS

Video Filters

Developer

Categories

Stacks

Tags

About Stacks Guru

Stacks Guru

Video Reference

Leave a Tip!

Field Office | Video 06 | Templates

Transcript

Search the page

{name}

{rating}

Stacks Guru

Built with

Stacks Plugin / Pro

Foundation 6

Total CMS