Collector Group Configurations

For Distributed SL1 Systems, a collector group is a group of Data Collectors. Data Collectors retrieve data from managed devices and applications. This collection occurs during initial discovery, during nightly updates, and in response to policies and Dynamic Applications defined for each managed device. The collected data is used to trigger events, display data in SL1, and generate graphs and reports.

Grouping multiple Data Collectors allows you to:

Create a load-balanced collection system, where you can manage more devices without loss of performance. At any given time, the Data Collector with the lightest load handles the next discovered device.
Create a redundant, high-availability system that minimizes downtime should a failure occur. If a Data Collector fails, another Data Collector is available to handle collection until the problem is solved.

Collector Groups

In a Distributed SL1 System, the Data Collectors and Message Collectors are organized as collector groups. Each monitored device is aligned with a collector group:

A Distributed SL1 system must have one or more collector groups configured. The Data Collectors included in a collector group must have the same hardware configuration.

A Distributed SL1 system could include collector groups configured using each of the possible configurations. For example, suppose an enterprise has a main data center that contains most of the devices monitored by the SL1 system. Suppose the enterprise also has a second data center where only a few devices are monitored by the SL1 system. The SL1 system might have two collector groups:

In the main data center, a collector group configured with high availability that contains multiple Data Collectors and Message Collectors.
In the second data center, a collector group that contains a single Data Collector that is also responsible for message collection.

Traditional and PhoneHome Collectors

SL1 supports two methods for communication between a Database Server (an SL1 Central Database or an SL1 Data Engine) and the SL1 Collectors:

Traditional
PhoneHome

In the Traditional method, the SL1 services on the Database Server initiate a new connection to the MariaDB port on the collector to read and write data. The connection request traverses the network, including the Internet if necessary, eventually reaching the collector. For this approach to work, the collector administrator must allow ingress communication from the Database Server on TCP port 7707, which is the MariaDB port on the collector. The communication is encrypted using SSL whenever possible.

The benefit of the traditional method is that communication to the Database Server is extremely limited, so the Database Server remains as secure as possible.

In the PhoneHome method, the collectors initiate an outbound connection to the Database Server over SSH. The connection requests originate from edge to core via TCP, using port 7705 by default.

After authenticating, the client forwards the local MariaDB port onto the Database Server using a loopback remote IP address. A corresponding SL1 appliance is added using the loopback IP. When the SL1 services on the database try to make a connection to the collector's MariaDB, they connect locally to the loopback IP address, in contrast to reaching out to the collector's IP or DNS name. The communication is encrypted.

The benefits of this method are that no ingress firewall rules need to be added, as the collector initiates an outbound connection, and no new TCP ports are opened on the network that contains the Data Collectors.

While you do not need to add any ingress firewall rules, a best practice is to add an egress firewall rule that allows SSH traffic from the collector on the server's port to either all available destination addresses on the DB or to the specific address on the DB that you know the collector will be able to reach. Starting with SL1 12.1.0, custom firewall rules must use the rich rules syntax and added to /etc/siteconfig/firewalld-rich-rules.siteconfig.

The PhoneHome configuration uses public key/private key authentication to maintain the security of the Database Server. Each Data Collector is aligned with an SSH account on the Database Server and uses SSH to communicate with the Database Server. Each SSH account on the Database Server is highly restricted, has no login access, and cannot access a shell or execute commands on the Database Server.

Using a Data Collector for Message Collection

To use a Data Collector for message collection, the Data Collector must be in a collector group that contains no other Data Collectors or Message Collectors.

NOTE: When a Data Collector is used for message collection, the Data Collector can handle fewer inbound messages than a dedicated Message Collector.

Using Multiple Data Collectors in a Collector Group

A collector group can include multiple Data Collectors to maximize the number of managed devices. In this configuration, the collector group is not configured for high availability:

In this configuration:

All Data Collectors in the collector group must have the same hardware configuration
If you need to collect syslog and trap messages from the devices aligned with the collector group, you must include a Message Collector in the collector group. For a description of how a Message Collector can be added to a collector group, see the Using Message Collection Units in a Collector Group section.
SL1 evenly distributes the devices monitored by a collector group among the Data Collectors in the collector group. Devices are distributed based on the amount of time it takes to collect data for the Dynamic Applications aligned to each device.
Component devices are distributed differently than physical devices; component devices are always aligned to the same Data Collector as its root device.

NOTE: If you merge a component device with a physical device, the SL1 system allows data for the merged component device and data from the physical device to be collected on different Data Collectors. Data that was aligned with the component device is always collected on the Data Collector for its root device. If necessary, data aligned with the physical device can be collected on a different Data Collector.

How Collector Groups Handle Component Devices

Collector groups handle component devices differently than physical devices.

For physical devices (as opposed to component devices), after the SL1 system creates the device ID, the SL1 system distributes devices, round-robin, among the Data Collectors in the specified collector group.

Each component device must use the same Data Collector used by its root device. For component devices, the SL1 System must keep all the component devices on the same Data Collector used by the root device (the physical device that manages the component devices). SL1 cannot distribute the component devices among the Data Collectors in the specified collector group.

NOTE: If you merge a component device with a physical device, the SL1 System allows data for the merged component device and data from the physical device to be collected on different Data Collectors. Data that was aligned with the component device is always collected on the Data Collector for its root device. If necessary, data aligned with the physical device can be collected on a different Data Collector.

High Availability for Data Collectors

To configure a collector group for high availability, the collector group must include multiple Data Collectors:

In this configuration:

All Data Collectors in the collector group must have the same hardware configuration.
If you need to collect syslog and trap messages from the devices monitored by a high availability collector group, you must include a Message Collector in the collector group. For a description of how a Message Collector can be added to a collector group, see the Using Message Collection Units in a Collector Group section.
On the Collector Group Management page (System > Settings > Collector Groups), each collector group that is configured for high availability must have the Collector Failover field set to On (Maximize Reliability) and a value in the Collectors Available for Failover field that specifies the minimum number of Data Collectors that must be available (i.e., with a status of "Available [0]") before a Data Collector failover can occur. For example:

For collector groups with only two Data Collectors, the Collectors Available for Failover field will contain the value "1 collector".
For collector groups with more than two Data Collectors, the Collectors Available for Failover field will contain values from a minimum of one-half of the total number of Data Collectors up to a maximum of one less than the total number of Data Collectors. For example, for a collector group with eight Data Collectors, the possible values in the Collectors Available for Failover field would be 4, 5, 6, and 7.
SL1 will never automatically increase the maximum number of Data Collectors that can fail in a collector group. For example, suppose you have a collector group with three Data Collectors and the Collectors Available for Failover field is set to "2". If you add a fourth Data Collector to the collector group, SL1 will automatically set the Collectors Available for Failover field to "3" to maintain the maximum number of Data Collectors that can fail as "one". However, you can override this automatic setting by manually changing the value in the Collectors Available for Failover field.

If you set the Collectors Available for Failover field to a value equaling half of your available Data Collectors and a 50% Data Collector outage occurs and the remaining Data Collectors are down by one, no rebalance will occur. If you specify a value equaling one-third of the total number of Data Collectors, then a rebalance will be attempted until your overall capacity falls below one-third of your Data Collectors, thereby maximizing your resiliency but minimizing the opportunity for your system to enter an unproductive rebalancing loop.

If the number of available Data Collectors is less than the value in the Collectors Available for Failover field, SL1 will not fail over within the collector group. SL1 will not collect any data from the devices aligned with the failed Data Collector(s) until the failure is fixed on enough Data Collector(s) to equal the value in the Collectors Available for Failover field. SL1 will generate a critical event.

Using Message Collectors in a Collector Group

If you need to collect syslog and trap messages from the devices monitored by a collector group that includes multiple Data Collectors, you must include a Message Collector in the collector group:

If your monitored devices generate a large amount of syslog and trap messages, a collector group can include multiple Message Collectors:

In this configuration, a monitored device can send syslog and trap messages to either Message Collector.

NOTE: Each syslog and trap message should be sent to only one Message Collector.

A third-party load-balancing solution can be used to distribute syslog and trap messages evenly among the Message Collectors in a collector group:

NOTE: ScienceLogic does not recommend a specific product for this purpose and does not provide technical support for configuring or maintaining a third-party load-balancing solution.

One or more Message Collectors can be included in multiple collector groups:

In this configuration, each managed device in collector group A and collector group B must use a unique IP address when sending syslog and trap messages. The IP address used to send syslog and trap messages is called the primary IP. For example, if a device monitored by collector group A and a device monitored by collector group B use the same primary IP address for data collection, one of the two devices must be configured to use a different IP address when sending syslog and trap messages.

A collector group can have multiple Message Collectors that are also included in other collector groups. It is possible to include every Message Collector in your SL1 System in every collector group in your SL1 System.