The Incident Response (IR) Lifecycle
- Preparation: “Establishing an incident response capability so that the organization is ready to respond to incidents.”
- Process to handle the incidents.
- Handler communications and facilities.
- Incident analysis hardware and software.
- Internal documentation (port lists, asset lists, network diagrams, current baselines of network traffic).
- Identifying training.
- Evaluating infrastructure by proactive scanning and network monitoring, vulnerability assessments, and performing risk assessments.
- Subscribing to third-party threat intelligence services.
- Detection and Analysis
- Alerts [endpoint protection, network security monitoring, host monitoring, account creation, privilege escalation, other indicators of compromise, SIEM, security analytics (baseline and anomaly detection), and user behavior analytics].
- Validate alerts (reducing false positives) and escalation.
- Estimate the scope of the incident.
- Assign an Incident Manager who will coordinate further actions.
- Designate a person who will communicate the incident containment and recovery status to senior management.
- Build a timeline of the attack.
- Determine the extent of the potential data loss.
- Notification and coordination activities.
- Containment, Eradication and Recovery
- Containment: Taking systems offline. Considerations for data loss versus service availability. Ensuring systems don’t destroy themselves upon detection.
- Eradication and Recovery: Clean up compromised devices and restore systems to normal operation. Confirm systems are functioning properly. Deploy controls to prevent similar incidents.
- Documenting the incident and gathering evidence (chain of custody).
- Post-mortem
- What could have been done better?
- Could the attack have been detected sooner?
- What additional data would have been helpful to isolate the attack faster?
- Does the IR process need to change? If so, how?
How the Cloud Impacts IR
Each of the phases of the lifecycle is affected to different degrees by a cloud deployment. Some of these are similar to any incident response in an outsourced environment where you need to coordinate with a third party. Other differences are more specific to the abstracted and automated nature of cloud.
Preparation
When preparing for cloud incident response, here are some major considerations:
- SLAs and Governance: Any incident using a public cloud or hosted provider requires an understanding of service level agreements (SLAs), and likely coordination with the cloud provider. Keep in mind that, depending on your relationship with the provider, you may not have direct points of contact and might be limited to whatever is offered through standard support. A custom private cloud in a third-party data center will have a very different relationship than signing up through a website and clicking through a license agreement for a new SaaS application.
- IaaS/PaaS vs. SaaS: In a multitenant environment, how can data specific to your cloud be provided for investigation? For each major service you should understand and document what data and logs will be available in an incident. Don’t assume you can contact a provider after the fact and collect data that isn’t normally available.
- “Cloud jump kit:” These are the tools needed to investigate in a remote location (as with cloudbased resources). For example, do you have tools to collect logs and metadata from the cloud platform? Do you have the ability to interpret the information? How do you obtain images of running virtual machines and what kind of data do you have access to: disk storage or volatile memory?
- Architect the cloud environment for faster detection, investigation, and response (containment and recoverability). This means ensuring you have the proper configuration and architecture to support incident response
Detection and Analysis
Data sources for cloud incidents can be quite different from those used in incident response for traditional computing. There is significant overlap, such as system logs, but there are differences in terms of how data can be collected and in terms of new sources, such as feeds from the cloud management plane.
Forensics and investigative support will also need to adapt, beyond understanding changes to data sources.
Containment, Eradication and Recovery
Always start by ensuring the cloud management plane/metastructure is free of an attacker. This will often involve invoking break-glass procedures to access the root or master credentials for the cloud account, in order to ensure that attacker activity isn’t being masked or hidden from lowerlevel administrator accounts. Remember: You can’t contain an attack if the attacker is still in the management plane. Attacks on cloud assets, such as virtual machines, may sometimes reveal management plane credentials that are then used to bridge into a wider, more serious attack.
The cloud often provides a lot more flexibility in this phase of the response, especially for IaaS.
Software-defined infrastructure allows you to quickly rebuild from scratch in a clean environment, and, for more isolated attacks, inherent cloud characteristics—such as auto-scale groups, API calls for changing virtual network or machine configurations, and snapshots—can speed quarantine, eradication, and recovery processes. For example, on many platforms you can instantly quarantine virtual machines by moving the instance out of the auto-scale group, isolating it with virtual firewalls, and replacing it.
This also means there’s no need to immediately “eradicate” the attacker before you identify their exploit mechanisms and the scope of the breach, since the new infrastructure/instances are clean; instead, you can simply isolate them. However, you still need to ensure the exploit path is closed and can’t be used to infiltrate other production assets. If there is concern that the management plane is breached, be sure to confirm that the templates or configurations for new infrastructure/ applications have not been compromised.
That said, these capabilities are not always universal: With SaaS and some PaaS you may be very limited and will thus need to rely more on the cloud provider.
Post-mortem
As with any attack, work with the internal response team and provider to figure what worked and what didn’t, then pinpoint any areas for improvement. Pay particular attention to the limitations in the data collected and figure out how to address the issues moving forward.
It is hard to change SLAs, but if the agreed-upon response time, data, or other support wasn’t sufficient, go back and try to renegotiate.