Site icon Karneliuk

Visualise and Analyse Your Data Centre Fabric with HAWK

Hello my friend,

Some time ago in LinkedIn we announced that we are working on the tool, which will allows you to model and analyse your network. As one of our primary focuses is data centres, we started from there. Despite it is an early stages, but we are happy and proud to introduce you HAWK: Highly-efficient Automated Weapon Kit. For now, this is a collection of the tools for the network management and analysis, but probably later we will put it under a joint hoot of some front-end, who knows…


1
2
3
4
5
No part of this blogpost could be reproduced, stored in a
retrieval system, or transmitted in any form or by any
means, electronic, mechanical or photocopying, recording,
or otherwise, for commercial purposes without the
prior permission of the author.

Where is the the border between network automation and software development?

In order to automate any network operation, you need to write a script, even if that is a simple one. On the other hand, any script is a program or software. This means that the creating of the scripts for the network automation is a form of the software development. And it is fun. It is not all the time easy, but this is definitely a way to go. Don’t listen to noise of those who says that you need to have a degree in a computer science to be able to do that. We can do it. So that you can do that as well. Learn with us how to build the working automation solutions for your infrastructure (networks, servers, virtual machines, containers and much more).

At our trainings, advanced network automation and automation with Nornir (2nd step after advanced network automation), we give you detailed knowledge of all the technologies relevant:

I recently participated in a training on Nornir with Anton Karneliuk. Highly recommended for all those who want to develop their automation skills.

Juan Pablo Azar Ricciardi @ Network Engineer en Veepee

Moreover, we put all mentions technologies in the context of the real use cases, which our team has solved and are solving in various projects in the service providers, enterprise and data centre networks and systems across the Europe and USA. That gives you opportunity to ask questions to understand the solutions in-depts and have discussions about your own projects. And on top of that, each technology is provided with online demos and you are doing the lab afterwards to master your skills. Such a mixture creates a unique learning environment, which all students value so much. Join us and unleash your potential.

Start your automation training today.

Why have we developed the HAWK?

All the things we are developing and publishing within the Karneliuk.com are the tools we are using ourselves in the daily job of our team or our customers. It means that we are creating network automation tools to solve the real world problems. However, we believe that our problems aren’t unique, what allows you to use what we have created to improve the efficiency of your network operation as well.

So, what is the HAWK?

The HAWK is an acronym for the Highly-efficient Automated Weapon Kit. We strongly believe that networking space is a war zone and to win there, you need to have a reliable and efficient tools. Also, obviously, you need know how to use them 🙂 So, the HAWK is such a tool. Or, to be precise, the collection of tools.

First HAWK’s tool: Data Centre Topology Analyzer and What-if Simulator

We started with the first practical use case. Some time ago one of our colleagues was doing the maintenance in the network, which aim was to bring the new version of the software on the data centre switches. Despite the network is highly redundant, the engineer hasn’t checked the actual status of the BGP connections. That was a mistake, as unfortunately some of the BGP sessions were not operational; therefore, once he started the upgrade, the two parts of the data centres were not able to communicate to each other. Definitely, that was not good at all.

So we started thinking, how we can make sure that we didn’t repeat this mistake again…

One of our key approaches at Karneliuk.com is to create solutions to address the root cause of problems, rather than dealing with consequences. As such, our thoughts went in a direction how to analyse the real network topology, so that we:

With these ideas in mind, we started working on the HAWK/topology-analyzer logic.

How does the HAWK/topology-analyzer work?

Despite the requirements aren’t necessary the most complicated, there are a few points, which at the very beginning we were not sure how to do, so we started working on the tool’s logic.

#1. What is the inventory?

The very first point, is where to take the data about your devices? Originally we started with a local YAML or JSON file, but quickly figured out that this approach doesn’t scale, as we need to manually populate those files.

Learn how to use the various data types, such as XML, JSON, and YAML at our network automation training.

However, we still plan to add the local inventory for the companies, which have a small infrastructure, we focused on the integration with the NetBox and using it as the primary data source. The approach is simple: in the configuration file you define the name of the data centre from the NetBox, the NetBox URL, and the roles of the devices you would like to pool from the NetBox. As each company may have its own name of the roles, we created a simple mapping table, which maps internal HAWK’s role leaf, spine, border, and aggregate to the lists of names, how the customer named those devices roles in the NetBox. That removes the necessity to modify the internal logic of the script.

The HAWK was created with Cumulus Linux (now NVIDIA) in mind, as that something we are working on a daily basis in the high scale live environment. This on the one hand affects some particular parts of the script, but we put a lot of efforts in the generalisation and these efforts are on-going now as well.

From the NetBox we collect the information about the devices. What is important is that the device:

#2. How the data is collected?

Instead of analysing the NetBox information (however, we can extend capabilities of the HAWK for that as well), the HAWK collects the information about the interfaces and their IP addresses and BGP sessions from the devices. The good thing about Cumulus Linux is that provides possibility to collect certain information directly in the JSON format, what simplifies it’s parsing. As there might be potentially hundreds (or thousands of the switches), we collect the information asynchronously using AsyncSSH library. This approach allows us to reach the best possible performance with Python.

#3. How the collected data is analysed?

We love math. That’s why, we use math to analyse all the collected data.

First of all, we are creating the network graph using the same approaches we explained earlier in the hyper-scaler data centre series. The graph is built using the real status of the BGP sessions between the devices. Therefore, if some sessions aren’t operational, they won’t be used for the path computation.

For building the graph we are using NetworkX library, which you can learn at our training.

The math graph gives us two important capabilities:

#4. How does the what-if analysis work?

When we do the modelling, we create a temporary copies of the graphs, so that:

Possible combinationsDo we test that
spine1YES
spine2YES
spine3YES
spine1, spine2YES
spine1, spine3YES
spine2, spine1NO (duplication of 4th one)
spine2, spine3YES
spine3, spine1NO (duplication of 5th one)
spine3, spine2NO (duplication of 7th one)

Once the analysis is done, you are getting the output showing the analysis of the check, which shows whether there is any issue in the connectivity based on defined rules. The output is delivered to CLI and as html report.

Where can I take the HAWK?

Fair question. By this time you might be bored with our explanations and want to see the HAWK. This project as many other our projects are freely available for you at GitHub:

  1. Go to https://github.com/karneliuk-com.
  2. Clone the repo using git clone https://github.com/karneliuk-com.
  3. Start using (read the next part).

How can I use the HAWK/topology-analyzer?

#1. Inventory

As said earlier, we use the NetBox as the inventory, hence you need to make sure you can access it. Here is the simple example of our topology.

In our training Automation with Nornir you can learn how to use Netbox for network provisioning.

As of now, the HAWK works only with Cumulus Linux, so that it will pull all the devices from the NetBox, but will connect and pull actual data only from the Cumulus-based switches. As you may have noticed, leaf11 doesn’t have primary IP set, which is potential issue. However, the HAWK checks if there is eth0 created, with any IP address assigned, so that the device still can be palled. Also, pay attention to the name of the data centre: it is called “NRN“, what would equal to slug nrn.

#2. Getting the HAWK ready

The preparation is very easy. First of all, make sure you have Python 3.7+ (probably, earlier versions will work either, but we haven’t tested that). Then you need to install all the packages from the requirements.txt file as follows:


1
$ python install -r requirements.txt

Once the modules are installed, modify the config.yaml file:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
$ cat config.yml
---
# Inventory details for HAWK
inventory:
  type: netbox
  parameters:
    url: http://192.168.1.70:8000
    token: 0123456789abcdef0123456789abcdef01234567
    site: nrn

# Logging details for HAWK
logging:
  enabled: True
  parameters:
    path: ./log/execution.log

# Output details for HAWK
output:
  type: local
  parameters:
    path: ./output

# Cache details for HAWK
cache:
  enabled: True
  parameters:
    path: ./.cache/raw_results.json
    path2: ./.cache/inventory.json

# Templates for HAWK
templates:
  parameters:
    path: ./templates

# Credentials type
credentials:
  type: any

# Device mapping table for builder
mapping:
  data_centre:
    leaf:
      - leaf
    spine:
      - spine
    border:
      - border
    aggregate:
      - aggregate
    dci:
      - dci-gw
  service_provider:
  enterprise:

# Command data
commands:
  path: ./bin/orders/all.json

# SSH parameters
ssh:
  timeout: 20
...

There are two important moments (you can leave others fields default):

  1. Provide the accurate details for the NetBox connectivity (url and token). If you don’t want to put token in the clear text file, put it to NB_TOKEN environment variable.
  2. Within the mapping part, perform the mapping of the NetBox roles to the internal HAWK’s roles, where the name of they key mapping.data_centre.leaf is used internally in the HAWK and associated list [“leaf”] is the name of the role within the NetBox. If you have more than one leaf type, just list all of them.

In our Network Automation Training you can learn a lot of details about the YAML and other data encodings.

The last point would to provide the credentials. You have two opportunities:

  1. You put them as environment variables HAWK_USER and HAWK_PASS (optionally also NB_TOKEN), so that you don’t need to provide them, once you run the script.
  2. You provide them in CLI once you are asked after you have launched the HAWK.

The last, but not least, make sure that your switches are available:


1
2
3
4
5
6
7
8
9
10
$ fping -4 -g 192.168.100.0/24 -q -s -a
192.168.100.1
192.168.100.181
192.168.100.182
192.168.100.185
192.168.100.186
192.168.100.187
192.168.100.188
192.168.100.189
192.168.100.190

Learn more about fping.

#3. Visualising the BGP topology

If all above was properly done, you now can use the HAWK to visual your data centre as simple as:


1
2
3
4
$ ./topology_analyzer.py -d nrn
Please, provide the credentials for the network functions and NetBox token:
Username > cumulus
Password >

After a few second (depending how big your DC is), the tool completes its job and you will have a new directory output created, which would have a sub-directory following your time:


1
2
$ ls output/
2021-04-07_21:49:02.855797

Within that directory, you would have an HTML document having your topology. It is generated using pyvis, so it is interactive one and you can explore that:

You can move the elements as much as you want. to have a better resolution (as well you can zoom in/out the topology). Once you hover the node, you will see its details.

In case the BGP session is not active (e.g., port is operationally down or the session is somewhere admin shut), the session is visualised as red with corresponding state:

#4. What-if analysis

To run the analysis, you need to specify that operation and tell which nodes you would like to bring down. For example, you may want to analyse the failure of the up to 2 spines (keeping in mind the example you see above, where one BGP session is not working):


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ ./topology_analyzer.py -d nrn -o analyze -ft spine -f 2
Please, provide the credentials for the network functions and NetBox token:
Username > cumulus
Password >

=======================================================================================================================================================================================================
Running the failure analysis for:    nrn
Amount of failed nodes up to:        2
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Results:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Failed nodes:                        spine1
Connectivity check:                  PASS
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Failed nodes:                        spine2, spine1
Connectivity check:                  FAIL
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Failed nodes:                        spine2
Connectivity check:                  FAIL
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Elapsed time:                        0:00:00.000345
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=======================================================================================================================================================================================================

Directly in CLI you are getting response telling whether connectivity check was done. What is the check, by the way? By default, the HAWK seeks for the connectivity between leafs and from leafs to exits, but that is configurable.

As with the drawing, you have the new sub-directory created, where you have the detailed analysis:

From the detailed report you would see, where are the connectivity issues.

You can also specify the particular node, which failure you would like to analyse:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ./topology_analyzer.py -d nrn -o analyze -fn leaf11
Please, provide the credentials for the network functions and NetBox token:
Username > cumulus
Password >

=======================================================================================================================================================================================================
Running the failure analysis for:    nrn
Amount of failed nodes up to:        1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Results:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Failed nodes:                        leaf11
Connectivity check:                  FAIL
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Elapsed time:                        0:00:00.000138
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=======================================================================================================================================================================================================

Obviously, if leaf11 fails, it won’t be able to communicate :-), but still this is an important test:

#5. Which keys to use?

That’s quite simple, just ask the tool:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ ./topology_analyzer.py -h
usage: OpenConfig Network Topology Grapher [-h] [-s] [-l] [-d DATACENTRE] [-f FAILED_NODES] [-ft FAILED_NODE_TYPES] [-fn FAILED_NODE_NAMES] [-ct CHECKED_NODE_TYPES] [-cn CHECKED_NODE_NAMES]
                                           [-o OPERATION] [-t TOPOLOGY]

This tool is polling the info from devices using gNMI using OpenConfig YANG modules and builds topologies.

optional arguments:
  -h, --help            show this help message and exit
  -s, --save            Cache the collected information.
  -l, --local           Use locally stored cache
  -d DATACENTRE, --datacentre DATACENTRE
                        Choose data centre
  -f FAILED_NODES, --failed_nodes FAILED_NODES
                        Number of failed nodes
  -ft FAILED_NODE_TYPES, --failed_node_types FAILED_NODE_TYPES
                        Type of the failed nodes to analyse. Allowed: leaf, spine, aggregate, border
  -fn FAILED_NODE_NAMES, --failed_node_names FAILED_NODE_NAMES
                        Name of the specific nodes, which shall be failed.
  -ct CHECKED_NODE_TYPES, --checked_node_types CHECKED_NODE_TYPES
                        Type of the nodes, which connections shall be checked during simulation. Allowed: leaf, spine, aggregate, border
  -cn CHECKED_NODE_NAMES, --checked_node_names CHECKED_NODE_NAMES
                        Number of specific nodes, which connections shall be checked during simulation.
  -o OPERATION, --operation OPERATION
                        Provide operation type. Allowed: analyze, draw
  -t TOPOLOGY, --topology TOPOLOGY
                        Provide topology type. Allowed: bgp-ipv6, bgp-evpn, lldp, bfd, bgp-ipv4

Some of those keys aren’t yet working (e.g., at the moment only bgp topologies are working, but other visualisations will be available soon.

Can I see how it works?

Subscribe to our YouTube channel. Very soon we will post there quite a few video demos of the HAWK operation.

Lessons Learned

For our team, there were three important insights:

Conclusion

Automation is not only about the configuration of the devices or making a basic reports. It is also about the modelling, testing and triggering the automated tasks. In the near future we’ll create some videos showing the usage of the HAWK, so that you will have a better understanding how to use that. After that we have plans to add some more drivers (e.g., OpenConfig/gNMI). Stay connected, stay tuned. Take care and good bye.

Support us





P.S.

If you have further questions or you need help with your networks, we are happy to assist you, just send us a message. Also don’t forget to share the article on your social media, if you like it.

BR,

Anton Karneliuk

Exit mobile version