Symptoms

I manage a 2 node VSX clusterXL environment that hosts 3 firewalls (virtual systems).
Some days ago I came across a problem in which one of them started  (or maybe more time ago but not aware) to experience bad performance, outages, timeouts…

Trying to make a failover and move the VS to the other node fixed the problem until after a while, again the problem. That is indicative of possible resource problem. I analyzed connections, CPU, memory, ARP table, logs… but I didn’t see anything unusual. Until finally I check the output of the “fw ctl pstat” command:

[Expert@vsx1:5]# fw ctl pstat

Virtual System Capacity Summary:
Physical memory used:  19% (9373 MB out of 48154 MB) - below watermark
Kernel   memory used:   4% (1950 MB out of 48154 MB) - below watermark
Virtual  memory used:  99% (4014 MB out of 4014 MB) - above watermark
Used: 3470 MB by FW, 544 MB by zeco
Concurrent Connections: 45% (159336 out of 349900) - below watermark
Aggressive Aging is enabled, in detect mode
...

Being out of “virtual memory”, the virtual system cannot process all the connections causing the problems described. It also explained the temporary fix when failovering to the other node, until the memory was again filled.

Solución para R77.30

Searching the knowledge base about the virtual memory, I found the following SK:
SK117914 – Kernel memory is 100% used when Application Control is enabled

Explains that the cause is an unlimited amount of non-compliant HTTP responses. And fixes the issue but configuring the following kernel parameters:

  • ws_ignore_http_resp_wo_req_error=0
  • ws_max_sessions_per_conn=100

Although my version is R80.10, i tried to apply the solution. But no success (as expected jaja).

Solucion para R80.10

Another SK I found:
SK106297 – “Virtual memory used” section of “fw ctl pstat” output on VSX is limited to 4014 MB

“FWK processes are 32-bit. Therefore, they are only able to work with up to 4GB of memory. 64-bit support for FWK processes in VSX Gateways is now available from Check Point R80.10.”

This can also be confirmed in the R80.10 – Release Notes:

“VSX Enhancements
– 64-bit support for VSX Gateways, increasing concurrent connections capacity.”

It seemed that my VSX was in 32 bit and that I could make the fwk process run in 64 bit so it would use a larger amount of memory.

Documentation

Below this section you can find the concrete procedure and commands I use. I based the procedure in the following links:
– R80.10 – Configuring 64-bit virtual system support
Setting Gaia kernel edition from 32-bit to 64-bit

And in case you have a clusterXL:
R77.X R80.X – Connectivity Upgrade Procedure

Procedure for a standalone VSX environment

During a maintenance window run the following command from the VS0 context:

vsx:0> vs_bits 64

“The VSX gateway will automatically run cpstop;cpstart.”

After applied, check the virtual system bit mode:

vsx:0> vs_bits -stat
All VSs are at 64 bits

And confirm the virtual memory assigned has been to the virtual systems has been increased significantly:

[Expert@vsx:5]# fw ctl pstat
Virtual System Capacity Summary:
Physical memory used:  15% (7585 MB out of 48154 MB) - below watermark
Kernel   memory used:   3% (1796 MB out of 48154 MB) - below watermark
Virtual  memory used:   2% (1563 MB out of 62922 MB) - below watermark
Used: 957 MB by FW, 544 MB by zeco
Concurrent Connections: 0% (8 out of 349900) - below watermark
Aggressive Aging is enabled, in detect mode
...

Procedure for a 2 node clusterXL HA VSX environment

In case you have more than 2 nodes the steps are very close to the ones explained here. Anyway, refer to the R77.X R80.X – Connectivity Upgrade Procedure.

This procedure should keep the connections up. Anyway, proceed with the following steps during a maintenance window:

Failover all instances to one of the nodes.

By running the command “clusterXL_admin down”, you can force certain VS or instance to failover to the other node.

For example, if VS2 is active in the second node (vsx2), this commands forces to failover it to the second:

vsx2:2> clusterXL_admin down

Run as needed in your environment and ensure one of the nodes has no active instances:

[Expert@vsx2:0]# cphaprob state

Cluster Mode:   Virtual System Load Sharing

Number     Unique Address  Assigned Load   State

1 (local)  10.1.1.12       100%            Active
2          10.1.1.13       0%              Standby

Local member is in current state since Wed Nov 14 11:49:47 2018

Cluster name: VSX_CLUSTER

Virtual Devices Status on each Cluster Member
=============================================

 ID    | Weight| VSX1      | VSX2
       |       |           | [local]
-------+-------+-----------+-----------
 1     | 10    | Active    | Standby
 2     | 10    | Active    | Down
 5     | 10    | Active    | Standby
---------------+-----------+-----------
 Active        | 3         | 0
 Weight        | 30        | 0
 Weight (%)    | 100       | 0

Legend:  Init - Initializing, Active! - Active Attention
         Down! - ClusterXL Inactive or Virtual System is Down

Configure 64 bits in the standby node

Run the following command from the VS0 context of the standby node:

[Expert@vsx2:0]# vs_bits 64

Switching to 64 bits will restart all VSs, causing downtime. Note that if you are connected via ssh the connection will be lost.
Are you sure you wish to proceed? (y/n) [n] 
y

Switching to 64 bits

“The VSX gateway will automatically run cpstop;cpstart.”

After applied, check the virtual system bit mode:

vsx2:0> vs_bits -stat
All VSs are at 64 bits

And confirm the virtual memory assigned has been to the virtual systems has been increased significantly:

[Expert@vsx2:5]# fw ctl pstat
Virtual System Capacity Summary:
Physical memory used:  15% (7585 MB out of 48154 MB) - below watermark
Kernel   memory used:   3% (1796 MB out of 48154 MB) - below watermark
Virtual  memory used:   2% (1563 MB out of 62922 MB) - below watermark
Used: 957 MB by FW, 544 MB by zeco
Concurrent Connections: 0% (8 out of 349900) - below watermark
Aggressive Aging is enabled, in detect mode
...

Failover instances to the updated node

Now the configured and non-configured to 64 bits nodes “dont like each other”.
To failover in an orderly manner and keeping connections, the “cphacu” command should be used. Run the following command in the non-updated to 64 bit node:

[Expert@vsx1:0]# cphacu stat

Connectivity upgrade status: Not enabled since member is Active
===============================================================

The local member is now Active and handling the traffic
=======================================================
...

In this state we cannot force a failover right now. The connections would not be kept.
To force a connection sync, run “cphacu start” from the updated node:

[Expert@vsx2:0]# cphacu start

Starting Connectivity Upgrade...

Dynamic routes synchronization started...
=========================================
Finished Dynamic routes synchronization.
Note: It may take a few seconds for the routing table to get updated.


Performing Full Sync
====================
Performing Full Sync on VSID 0. This may take several minutes (depending on the number of connections); please wait...
Performing Full Sync on VSID 1. This may take several minutes (depending on the number of connections); please wait...
Performing Full Sync on VSID 2. This may take several minutes (depending on the number of connections); please wait...
Performing Full Sync on VSID 5. This may take several minutes (depending on the number of connections); please wait...

=========================================================================
Full Sync ended (Delta Sync is enabled)
For delayed connections (Templates) to be synchronized it is recommended to turn off SecureXL
on the old member before doing a failover. Run: 'fwaccel off -a' on the old member, on VS 0.
Please note: turning SecureXL off might slow down existing connections.
=========================================================================

Connectivity upgrade status: Enabled, ready for failover
========================================================

The peer member is handling the traffic
=======================================
Version of the local member: 3123
Version of the peer member : 3122

Connections table
=================
VS      HOST                  NAME                    ID #VALS #PEAK #SLINKS
0       localhost             connections           8158   196   241     333
1       localhost             connections           8158 17292 17424   61775
2       localhost             connections           8158 42155 42634  125856
5       localhost             connections           8158 166988 167123  667700

Note the “Connectivity upgrade status: Enabled, ready for failover”. Check the stat again in the other node:

[Expert@vsx1:0]# cphacu stat

Connectivity upgrade status: Disabled
=====================================

The peer member is handling the traffic
=======================================
...

Now proceed to stop SecureXL in the non-updated node (this is required to synchronize delayed connections)

[Expert@vsx1:0]# fwaccel off -a

Finally, force the failover by stopping checkpoint services

[Expert@vsx1:0]# cpstop
cpwd_admin:
Process DASERVICE ctx=0 terminated
Stopping SmartView Monitor daemon ...
SmartView Monitor daemon is not running
Stopping SmartView Monitor kernel ...
SmartView Monitor kernel stopped
Stopping sessions database
...

The instances are now active in the updated node keeping the connections

Configure 64 bits in the non-updated node

Run the “vs_bits” command in the context 0

[Expert@vsx1:0]# vs_bits 64
Switching to 64 bits will restart all VSs, causing downtime. Note that if you are connected via ssh the connection will be lost.
Are you sure you wish to proceed? (y/n) [n]
y
Switching to 64 bits
Done. log may be found at /var/log/vs_bits.log

After finishing the process, the system is again in a stable state.

[Expert@vsx1:0]# cphaprob state
Cluster Mode:   Virtual System Load Sharing

Number     Unique Address  Assigned Load   State

1 (local)  10.1.1.12       100%            Active
2          10.1.1.13       0%              Standby

Local member is in current state since Wed Nov 14 15:17:13 2018

Cluster name: VSX_CLUSTER

Virtual Devices Status on each Cluster Member
=============================================

 ID    | Weight| VSX1      | VSX2
       |       | [local]   |
-------+-------+-----------+-----------
 1     | 10    | Active    | Standby
 2     | 10    | Standby   | Active
 5     | 10    | Active    | Standby
---------------+-----------+-----------
 Active        | 2         | 1
 Weight        | 20        | 10
 Weight (%)    | 66        | 34

Legend:  Init - Initializing, Active! - Active Attention
         Down! - ClusterXL Inactive or Virtual System is Down

And the virtual memory is not full anymore!

[Expert@vsx1:0]# fw ctl pstat

Virtual System Capacity Summary:
Physical memory used: 19% (9253 MB out of 48154 MB) - below watermark
Kernel memory used: 3% (1891 MB out of 48154 MB) - below watermark
Virtual memory used: 5% (3280 MB out of 62922 MB) - below watermark
...

As conclusion, cannot understand why when upgrading to R80.10, the 64-bit mode is not enabled by default. Anyway, this behaviour is changed in R80.20 where 64-bit mode is the only one:
SK140332 – The “vs_bits” command does not work in R80.20