Ambra Status

Have Questions? Contact Us:
support@ambrahealth.com

Ambra Platform Announcements

Update to January Release Schedule

The Ambra team has updated the upcoming release schedule to the following dates: 

1/18/23 - EU, Asia, APAC UAT
1/19/23 - Private Clouds
1/25/23 - US Production, APAC Production

If you have any questions or concerns, please connect directly with your Customer Success Manager, or submit a ticket via the Service Now portal

RCA: Slowness Event, November 16th
ROOT CAUSE:
  • Ops team was performing typical system maintenance duties on the evening of Tues 11/15
  • Due to a human error systems were modified in a fashion that was unintended
  • The erroneous changes modified system capacity and due to the human error were overly intense and broad changes
 RESOLUTION:
  • Our teams became aware of the issue the morning of 11/16 when alerts fired followed by customers reporting the issue.  The ops team recognized the error and went into correction mode immediately
  • Original systems were restored to their full capacity
  • Additional systems were created and placed online to help with the backlog, leveraging the automation we have built as part of our migration to AWS
 NEXT STEPS:
  • Ops team will have a retrospective on this incident to reinforce proper change control processes and identify any further internal optimizations to reduce risk of future maintenance
RCA: Outage Event, September 27th

Root Cause / Resolution:
As part of our AWS migration we have invested heavily in automation to de-risk manual changes that are hard to track and review.  Today we used this automation to roll out a small innocuous change, but it caused a period of partial outage impacting a number of our customers.  This issue was tracked back to a manual / non-automated change made just after our migration to AWS on the weekend of 9/18.  As a result our automation did not have this change and it was removed today with the smaller change.

Next Steps:
(1) The team has done a full review of the change that was made and reinforced to all team members that the only changes allowed are through our automation to avoid this exact type of issue.  Since this manual change was done during a highly sensitive period during the migration itself, it is not a typical risk going forward as well.
(2) The team is researching how we can leverage this automation in a more focused manner to identify problems before they hit our production platform through the use of code reviews and test deployments in the future.

RCA: Slowness Event, September 9/20-9/22

Issue:
PNAP Storage engines were experiencing instability that resulted in delays with studies appearing in Ambra and/or slowness in downloading to PACS. However, new studies in AWS where the majority of our users are now ingesting new studies were not impacted.

Root Cause/Resolution:
After much investigation the team determined that with the August release Ambra introduced a new job scheduling system for our storage service, to facilitate better performance in the AWS cloud. It was the size of the anonymization jobs that were triggered by our AI service that caused the storage engine issue. The job scheduling system performed fine in our AWS-based storage engines, but we saw periodic instability in our legacy PNAP (datacenter) based storage engines.

The issue was resolved by optimizing how the AI anonymization jobs are ingested and persisted in the Storage job database. Using compression and some improved logic, the size of these jobs was reduced by over 75%, eliminating the query bottleneck and allowing storage to function normally in all environments, the PNAP datacenter, and AWS.
Please note that the only customers impacted by this issue were the customers whose storage was still in the PNAP data center at the time of the event. Most of our customers have already moved to AWS for storage so their new studies would not have been impacted by this unless they had an issue when accessing historical studies still in PNAP.

Upcoming Ambra Health Maintenance, September 18th
Ambra has an upcoming maintenance scheduled for this Sunday, September 18 at 12:00am ET. 
  
This maintenance window will last approximately four hours until around 4:00 AM ET. During this time period, the Ambra Platform will be unavailable. We will notify customers via email at the beginning and end of the maintenance period.   
 
The purpose of this maintenance is to transition to a new database architecture. This will require our team to bring the entire Ambra application offline for the full duration of the maintenance window. This means no Ambra services will be available until the maintenance has concluded.
 
Customers that rely on the application during this time must plan to implement business continuity measures.
 
The local Ambra Gateway will cache images during the maintenance downtime and will automatically send them once the maintenance window has concluded and the Ambra application is back online.   
 
We appreciate your continued patience and support. If you have any questions or concerns about this maintenance, please connect with our Support Team via the Intelerad Service Portal; or work directly with your Client Success manager. 

The Ambra Health Team 
RCA: Outage & Slowness Event, September 15th

Issue:

Starting at approximately 12:30 PM ET the Ambra platform started getting reports of extreme slowness for all types of transactions including ingesting studies. 

Times:

12:30 PM ET to 7:00 PM ET transaction & ingestion slowness

7:00 PM ET to 7:25 PM ET – System Unavailable Maintenance

7:25 PM ET to 7:50 PM ET – UI slowness resolved. But some slowness loading data while transactions in queues caught up to real-time. 

Root Cause/Resolution:

The engineering team has identified a critical component which has been degraded since the 9/14 hardware event. This component was recycled and came back up healthy. 

Note: The AWS migration (which has been in the planning phase for several months) that occurred on 9/18 migrated these same services to AWS. Moving to a cloud-native service backed up by AWS resources removes the risk of this type of power issue happening again.

RCA: Outage & Slowness Event, September 14th

Issue:
At approximately at 11:30 AM ET on 9/14 the Ambra application had an outage followed by slow response times and a backlog of transactions in system queues causing slowness ingesting studies and other transactions on the system.

Time:
11:30 AM ET to 2:45 PM ET Outage
2:25 PM ET to 9:25 PM ET transaction & ingestion slowness during backlog catchup period

Root Cause/Resolution:
The PNAP data center experienced a cascading power issue, that impacted our primary and secondary power supply. The power issue impacted some of our core services such as DNS and the Database. When system was successfully brought back online, a large backlog of pending gateway transactions had accumulated which in turn caused slow UI response times and waits for studies and other data transactions to become available online.

Note: The AWS migration (which has been in the planning phase for several months) that occurred on 9/18 migrated these same services to AWS. Moving to a cloud-native service backed up by AWS resources removes the risk of this type of power issue happening again.

 

RCA: Slowness Event, September 9th

Issue:

The Ambra Platform users experienced overall slow performance and some users were unable to log in. 

Time: 2:30PM ET to 3:10PM ET 

Root Cause/Resolution:

The platform had a high surge in API requests and the front-end web processes were blocked waiting for API calls to complete. To recover, the team cancelled some stalled processes.  This freed up enough overhead for the surge to be handled without further impacting performance.

RCA: Slowness Event, September 7th

Issue:

At approximately 12:30 on 9/7 slowness was observed with the PNAP job processing which in turn caused slow study ingestion, slow viewing, and general UI slowness.   This issue did not impact AWS storage (see note below) 

Time:

12:30PM ET to 3:45 PM ET 

Root Cause/Resolution:

All Ambra storage engines use a database for managing persistent jobs. It was discovered that the job processing database for PNAP was not holding up when certain types of loads hit it (such as large pixel modification jobs.) The short-term fix for PNAP was to reset the database, and to follow-up with replaying any jobs from a backup of the database.    

Note: AWS storage was not impacted by this issue as we use a completely different underlying database technology that is significantly more robust and scalable than what is available on individual PNAP storage engines.

HL7 Report Issues Update: 8/25/22
  • At this time, HL7 Report Data has been restored.  Customers should now be able to access previously missing radiology reports.
RCA: Slowness Event, August 24th

Issue:

The Ambra Platform users experienced slow performance.

Time: 10:30AM ET to 12:00 PM ET 

Root Cause/Resolution:

We had a database maintenance action that led to slowness. This was an unfortunate fallout from a critical preparation step in our full migration for all services to AWS.

Note: The silver lining here is this was a one-time event that will not be repeated and despite the incident we have left the database in a more resilient state ready for the further improvements we have scheduled.

RCA: Outage, August 23rd

Issue:

Customers using PNAP storage experienced issues with new study ingestion (uploads) to Ambra. (AWS storage was not impacted see “Note Below”)

Time: 12AM ET to 9PM ET 

Root Cause/Resolution:

This issue appears to be related to specific DICOM data being mishandled by our code (apparently an extreme edge case) because of a new job scheduling engine in storage. The corrective action was to restart the grid services on the PNAP storage nodes when they get stalled so that they will process the queue again.

Note: Within AWS Ambra has split out the storage services to be more resilient and dynamic so this stack was not impacted.

RCA: Incident: August 4th

Issue:

At approximately 11:30 ET on Aug 4th Ambra monitoring observed services queues rapidly rising and customers started experiencing processing delays.

Duration:

Aug 4, 2022, from 11:30 AM ET to 2:08 PM ET

 Resolution:

  • To assist with processing the backlog of transactions the team added temporarily additional “queue workers” to enable services to process the backlog faster.
  • Our newly established alerting mechanisms setup in the past few months fired appropriately and informed our teams of the incident only a few minutes after it began to occur so we could act quickly.
  • As a result of our expanded DevOps team structure and investments in AWS and automation, we were able to speed the deployment of the additional queue-workers faster than in the past to help clear the backlogged queue in record time,

 Next steps:

  • Queue Management options: We are researching new scripts and tools that can help us manually manage the queue in a more dynamic fashion to return our customers to stability more rapidly.
  • Rapid Response options: We continue to refine the process and automation of bringing additional queue processors online for outlier events such as the backlog experienced today.  Further improvements in these areas will again bring about a more rapid response to unforeseen conditions.
RCA: Incident, Aug 3rd

Issue:

At approximately 2:30 PM ET users started experiencing platform slowness.

Duration:

Aug 3, 2022 from 2:30 PM ET to 3:50 PM ET

 Resolution:

The team conducted an analysis on the transactions and was able to identify multiple long running queries that were “killed” to free up database resources.

 

Next steps:

  • Query Research: The team is researching the root cause of these long-running queries and will be deciding on the right set of infrastructure and product changes to both increase response time or eliminate the source of these queries completely.  While this appears to be a rare event, we will continue to investigate continuous improvement approaches
  • AWS Migration Continued: As many of you know we are continuing to move more of our shared platform services to AWS where we can leverage additional scale and compute services to absorb unplanned issues such as these further
RCA: Slow Ingestion Event, August 18th

RCA: Slowness Event, August 18th

Issue:
The Ambra Platform users experienced slow performance on all transaction types including upload, gateway processing and UI actions post August Release.

Time: 6:30AM ET to 2:00PM ET

Root Cause/Resolution:
A load-related problem was identified following our last release that caused major contention at the database level. This was linked to a new feature and while the platform was up and operating for some, major slowness and delays were encountered. To correct the contention this feature was backed out to recover performance for all platform clients.

Next Steps:
Intelerad is investing in building out more mature performance and load testing processes to try to avoid this problem in the future

CVE-2021-44228

Security Advisory: CVE-2021-44228 - Ambra

December 12, 2021 - 15:00 EST


The Intelerad team, which now includes Ambra, has been working diligently to assess and mitigate any risks introduced by  CVE-2021-44228.


At the time of this Security Advisory we can confirm that Ambra software is not affected by this vulnerability.  While some of our software utilizes the Log4j library, the versions utilized are not affected by this specific vulnerability.  No further action from our customers is required at this time.


The Intelerad team continues to monitor this threat closely and will provide additional updates as needed.  Should you have any further questions, please feel free to open a support ticket or contact your account manager directly.


Thank you,

Intelerad Information Security Team

Gateway Directory

Once you have an Ambra Gateway installed you can request a PACS to PACS connection with any facility in the Ambra Gateway Network, visible in the Gateway Directory.

Request a Gateway →