Network Logging Requirements
Table of Contents
Goals
- To effectively manage existing and future network platforms logs (and related elements).
- To ensure expedient and reliable transport and storage of emitted logs
- To detect, respond to, and restore services due to faults and failures
- To expedite the correlation and root cause analysis of faults across the IT footprint
- To diminish MTTR->0 and increase MTBF->∞
- To base decision making upon empirical evidence.
- To enact and enable continual Service Improvement.
Background
By virtue of providing a network fabric (thus a service platform to the whole business) it is incumbent upon operations and engineering teams to manage all network resources not just individually, but as an aggregate, and as effectively as possible to ensure a reliable, predictable, and efficient service is provided. This entails standard network and security logging from elements for a wide array of events, triggers, and informational aspects. A key part of operations is situational awareness and without proper instrumentation and logging from devices, one might as well be flying blind.
In-Scope
Organsation_Name
(NEMS) including but not limited to:
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
SC01 |
Cisco routers, switches |
Mandatory |
Existing footprint |
SC02 |
Cisco Call Managers, UC servers and Gatekeepers/CUBE |
Mandatory |
Existing footprint |
SC03 |
Cisco UCS servers, UCS Manager (and Fabric Interconnects) |
Mandatory |
Existing footprint |
SC04 |
Palo Alto firewalls (both hardware and software instantiations) |
Mandatory |
Existing footprint |
SC05 |
Opengear Terminal Servers |
Mandatory |
Existing footprint |
SC06 |
Cisco Wireless Controllers |
Mandatory |
Existing footprint |
SC07 |
Cisco Distributed vSwitch |
Mandatory |
Existing footprint |
SC08 |
Cisco Prime |
Mandatory |
Existing footprint |
SC09 |
Infoblox Appliances (DNS/DHCP/IPAM) |
Mandatory |
Existing footprint |
SC10 |
APC UPS and PDUs |
Mandatory |
Existing footprint |
SC11 |
Infoblox Appliances |
Mandatory |
Existing footprint |
SC12 |
Nimble Storage |
Mandatory |
Existing footprint |
Out-of-Scope
- Amazon VPC (vGW, iGW)
- Teleworkers Cisco 881
- Anything not explicitly listed ”in-scope”.
Assumptions
- All network elements will be reachable from their management control plane interfaces (either in-band via loopbacks or dedicated management interfaces)
- All network elements will provide a reachable layer 3 interface across a management zone or primary data path from a centralized sensor per region (AMER, EMEA, APAC)
- All network elements will require their CPU, memory, storage, and environmental thresholds to be logged when exceeded to generate alerts based upon said pre-defined thresholds
- Fundamental/basic monitoring and trending is required across all network assets to establish normal patterns and to help facilitate capacity management
- The logging platform will run on linux or a *nix derivative
- The logging platform and sensors may be run in a VMWare virtualised environment
- regional collectors are expected to be available.
Note: The following requirements will focus primarily upon the format, content, style, configuration, and handling of emitted logs. The network team will emit all logs (where capable) as RFC5424 syslog compatible formats with Facility Code 23 (local7) and Severity Code 5 (notifications).
Note: The header of the Syslog message contains “priority”, “version”, “timestamp”, “hostname”, “application”, “process id”, and “message id”. It is followed by structured-data, which contains data blocks in the “key=value” format enclosed in square brackets “[]”.
Architectural Requirements
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
Meets |
AR01 |
There shall be collectors/probes in each major geographical region (e.g. AMER, EMEA and APAC). |
Mandatory |
* To minimise latency * For fault tolerance * To minimise inter-regional traffic and logging |
|
AR02 |
The solution shall ingest logs on UDP port 514 |
Mandatory |
* Standard connectionless UDP syslog |
|
AR03 |
The solution shall ingest logs on TCP port 514 |
Mandatory |
* Standard connection-orientated TCP syslog |
|
AR04 |
Elements shall only log to regional ingestion collectors (until cross-regional de-duplication is available as standard). |
Desirable |
* UDP syslog may not get recorded if a local regional collector is down |
|
AR05 |
The solution will expect logs in UTC time thus all network elements will be configured to log based upon UTC. |
Mandatory |
* Standardisation of global logging to allow for simple correlation rather than parsing timezones and manipulating data |
|
AR06 |
The solution will support and expect millisecond timestamps from network elements |
Mandatory |
* Standardisation and requirement for complex interdependencies at machine times (pre/post event dependencies) |
|
AR07 |
The solution will provide plain text search. |
Mandatory |
|
|
AR08 |
The solution shall allow logs to be split and tagged based upon delimiters and regex. |
Mandatory |
|
|
AR09 |
The solution shall be able to ingest syslog on non-standard ports and filter said logs to different groupings or apply tags. |
Desirable |
|
|
Functional Requirements (The ‘What’)
Note: Network Element Logging Requirements
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
Meets |
FR01 |
The logging device shall emit logs with facility Local 7. |
Mandatory |
* This will allow for rapid grouping of NEMs and network operational logs |
|
FR02 |
The logging device shall emit logs with severity 5 (Notification). |
Mandatory |
* This will allow for rapid grouping of NEMs and network operational logs at a severity that provides operational intelligence without too much verbosity |
|
FR03 |
The logging device shall emit logs with millisecond timestamps. |
Mandatory |
* This will allow for immediate and globalised correlation |
|
FR04 |
The logging device shall emit logs for all authentication and authorisation events. |
Mandatory |
* To gain insight in to operators and agents interacting with elements for operational and security purposes (including audit logs) |
|
FR05 |
The logging device shall use its management loopback to source syslog and traps from. |
Mandatory |
* To utilise an always up and consistent source IP address to instrument on the management plane from (even if traversing in-band) |
|
FR06 |
The logging device shall log state of IGP and EGP neighbour state changes. |
Mandatory |
* This allows for distributed state to be tracked in relation to routing events and operational troubleshooting |
|
Reporting Requirements
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
Meets |
RP01 |
The solution shall provide automated and scheduled reporting. |
Mandatory |
* Automation and Continual Service Improvement |
|
RP02 |
The solution shall provide manual and on-demand reporting. |
Mandatory |
* Ad-hoc reporting and configuration of ‘one time’ reports. |
|
RP03 |
The solution shall provide active and visually queried graphing of metrics. |
Mandatory |
* GUI and mouse driven selection of graph subsections which result in new queries and results/graphs. |
|
RP04 |
The solution shall provide for user configurable dashboards. |
Mandatory |
* Custom dashboards for per user, team, and departmental views. |
|
Security Requirements
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
Meets |
SEC01 |
The solution shall provide web based access via TLS based HTTPS for administration/reporting. |
Mandatory |
* Basic security requirements. |
|
SEC02 |
The solution shall support armored data ingestion (e.g. https, SSH, TLS etc) |
Mandatory |
* Basic security requirements. |
|
SEC03 |
The solution shall support strong ciphers for symmetric encryption. |
Desirable |
* Basic security requirements. |
|
SEC04 |
The solution shall support an internal audit log for AAA(Authentication, Authorization, and Accounting) which can be audited by administrators. |
Mandatory |
* Basic auditing and least access principle. |
|
SEC05 |
The solution shall support external authentication methods such as Radius (and/or Active Directory or SAML based authentication) for administration. |
Desirable |
* To support AAA. |
|
SEC06 |
The solution shall provide for 2FA administrator and operator access. |
Desirable |
* Basic 2FA integration preferably via DUO. |
|
SEC07 |
The solution shall support encrypted data storage (Data at Rest). |
Mandatory |
* Basic Security (Data at Rest) |
|
SEC08 |
The solution shall support encrypted channels for instrumentation and telemetry shipping from collectors to analyzers/reporters (Data in Transit). |
Mandatory |
* Basic Security (Data in Transit across untrusted/semi-trusted) |
|
Integration Requirements
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
Meets |
GR01 |
The solution shall provide a rich RESTful API to facilitate automation and custom event actions as part of wider workflows. |
Mandatory |
* Basic API integrations and leveraging other systems and triggers. |
|
GR02 |
The solution shall provide integration with PagerDuty. |
Desirable |
* 3rd party ecosystem for operations, alerting, and escalations |
|
GR03 |
The solution shall provide integration with Slack. |
Desirable |
* 3rd party ecosystem for operations, alerting, and escalations (suppress alerts, update status). |
|
GR04 |
The solution shall provide integration with DataDog. |
Desirable |
* 3rd party ecosystem for operations, alerting, and escalations |
|
GR05 |
The solution shall provide integration with Cisco Prime. |
Desirable |
* Basic monitoring of other management platforms. |
|
Use Cases (‘Stories’)
Network Operations Uses
ID |
Description |
Importance (Mandatory/Desirable) |
Rationale or Comments (if applicable) |
Meets |
ST01 |
Network Operations investigate a WAN or service outage and are trying to determine how often an interface or interfaces were flapping over a time period |
Mandatory |
* Exact times of interface UP/DOWN |
|
ST02 |
Network Operations investigate a WAN or service outage and are trying to correlate to other events, faults, or outages on other platforms or endpoints that may provide root cause or additional context to service impact(s) |
Mandatory |
* Heterogenous systems are interconnected and can become force multipliers exacerbating issues or indeed even causing them in the first place |
|
ST03 |
Network Operations general troubleshooting of routing state changes to help isolate and determine faults or distributed issues in the RIB (Routing Information Base) |
Mandatory |
* Standard operations |
|
ST04 |
Network Operations can validate changes made by a specific operator via planned Change Management (or unplanned). |
Mandatory |
* Standard operations |
|
ST05 |
Network Operations (ENVMON) environmental and hardware platform issues via syslog that are not available via SNMP. |
Mandatory |
* Standard operations |
|
ST06 |
Pre-Network Support : AirSupport (level 1) visibility of user device authentication issues (by 802.1X, Radius, etc) |
Mandatory |
* Standard operations |
|
ST07 |
Pre-Network Support : AirSupport (level 1) custom dashboards for certain event frequencies and volumes that can be presented in a simple visual representation. |
Mandatory |
* Data visualisation |
|
ST08 |
Network Operations : Escalation of performance issues in relation to specific event types detected/matched (via syslog regex?). |
Mandatory |
* Standard operations |
|
ST09 |
Network Operations: Ability to define the ‘business logic’ or ‘operational logic’ for correlating different types of events based upon different scenarios or thresholds of multiple events, which themselves trigger events and alerts. |
Desirable |
* Advanced operations |
|
ST10 |
Network Operations: Alerting on low light from transceivers (via syslog regex?). |
Mandatory |
* Standard operations |
|
ST11 |
Network Operations: Alerting when certain user types/IDs/names log in (via syslog regex?). |
Mandatory |
* Standard operations |
|
ST12 |
Network Operations: Segregation, separation, grouping, or tagging of logs based upon FACILITY. |
Mandatory |
* Standard syslog filtering and local redirection |
|
ST13 |
Network Operations: Segregation or grouping of device types based upon: Local3 -> FW/Security, Local4 -> IAM/DNS/DHCP, Local5 -> ENV (Power/HVAC), Local6 -> LocalWireless LAN Controllers, Local7 -> Routers/Switches (via some manner e.g. FACILITY, SRC IP, hostname etc) |
Mandatory |
* Standard operations |
|
ST14 |
Network Operations: Ability to remap FACILITY based upon some arbitrary data such as received PORT, SRC IP, hostnames etc. |
Desirable |
* Advanced operations |
|