The Ambra team has updated the upcoming release schedule to the following dates:
1/18/23 - EU, Asia, APAC UAT
1/19/23 - Private Clouds
1/25/23 - US Production, APAC Production
If you have any questions or concerns, please connect directly with your Customer Success Manager, or submit a ticket via the Service Now portal.
Root Cause / Resolution:
As part of our AWS migration we have invested heavily in automation to de-risk manual changes that are hard to track and review. Today we used this automation to roll out a small innocuous change, but it caused a period of partial outage impacting a number of our customers. This issue was tracked back to a manual / non-automated change made just after our migration to AWS on the weekend of 9/18. As a result our automation did not have this change and it was removed today with the smaller change.
Next Steps:
(1) The team has done a full review of the change that was made and reinforced to all team members that the only changes allowed are through our automation to avoid this exact type of issue. Since this manual change was done during a highly sensitive period during the migration itself, it is not a typical risk going forward as well.
(2) The team is researching how we can leverage this automation in a more focused manner to identify problems before they hit our production platform through the use of code reviews and test deployments in the future.
Issue:
PNAP Storage engines were experiencing instability that resulted in delays with studies appearing in Ambra and/or slowness in downloading to PACS. However, new studies in AWS where the majority of our users are now ingesting new studies were not impacted.
Root Cause/Resolution:
After much investigation the team determined that with the August release Ambra introduced a new job scheduling system for our storage service, to facilitate better performance in the AWS cloud. It was the size of the anonymization jobs that were triggered by our AI service that caused the storage engine issue. The job scheduling system performed fine in our AWS-based storage engines, but we saw periodic instability in our legacy PNAP (datacenter) based storage engines.
The issue was resolved by optimizing how the AI anonymization jobs are ingested and persisted in the Storage job database. Using compression and some improved logic, the size of these jobs was reduced by over 75%, eliminating the query bottleneck and allowing storage to function normally in all environments, the PNAP datacenter, and AWS.
Please note that the only customers impacted by this issue were the customers whose storage was still in the PNAP data center at the time of the event. Most of our customers have already moved to AWS for storage so their new studies would not have been impacted by this unless they had an issue when accessing historical studies still in PNAP.
Issue:
Starting at approximately 12:30 PM ET the Ambra platform started getting reports of extreme slowness for all types of transactions including ingesting studies.
Times:
12:30 PM ET to 7:00 PM ET transaction & ingestion slowness
7:00 PM ET to 7:25 PM ET – System Unavailable Maintenance
7:25 PM ET to 7:50 PM ET – UI slowness resolved. But some slowness loading data while transactions in queues caught up to real-time.
Root Cause/Resolution:
The engineering team has identified a critical component which has been degraded since the 9/14 hardware event. This component was recycled and came back up healthy.
Note: The AWS migration (which has been in the planning phase for several months) that occurred on 9/18 migrated these same services to AWS. Moving to a cloud-native service backed up by AWS resources removes the risk of this type of power issue happening again.
Issue:
At approximately at 11:30 AM ET on 9/14 the Ambra application had an outage followed by slow response times and a backlog of transactions in system queues causing slowness ingesting studies and other transactions on the system.
Time:
11:30 AM ET to 2:45 PM ET Outage
2:25 PM ET to 9:25 PM ET transaction & ingestion slowness during backlog catchup period
Root Cause/Resolution:
The PNAP data center experienced a cascading power issue, that impacted our primary and secondary power supply. The power issue impacted some of our core services such as DNS and the Database. When system was successfully brought back online, a large backlog of pending gateway transactions had accumulated which in turn caused slow UI response times and waits for studies and other data transactions to become available online.
Note: The AWS migration (which has been in the planning phase for several months) that occurred on 9/18 migrated these same services to AWS. Moving to a cloud-native service backed up by AWS resources removes the risk of this type of power issue happening again.
Issue:
The Ambra Platform users experienced overall slow performance and some users were unable to log in.
Time: 2:30PM ET to 3:10PM ET
Root Cause/Resolution:
The platform had a high surge in API requests and the front-end web processes were blocked waiting for API calls to complete. To recover, the team cancelled some stalled processes. This freed up enough overhead for the surge to be handled without further impacting performance.
Issue:
At approximately 12:30 on 9/7 slowness was observed with the PNAP job processing which in turn caused slow study ingestion, slow viewing, and general UI slowness. This issue did not impact AWS storage (see note below)
Time:
12:30PM ET to 3:45 PM ET
Root Cause/Resolution:
All Ambra storage engines use a database for managing persistent jobs. It was discovered that the job processing database for PNAP was not holding up when certain types of loads hit it (such as large pixel modification jobs.) The short-term fix for PNAP was to reset the database, and to follow-up with replaying any jobs from a backup of the database.
Note: AWS storage was not impacted by this issue as we use a completely different underlying database technology that is significantly more robust and scalable than what is available on individual PNAP storage engines.
Issue:
The Ambra Platform users experienced slow performance.
Time: 10:30AM ET to 12:00 PM ET
Root Cause/Resolution:
We had a database maintenance action that led to slowness. This was an unfortunate fallout from a critical preparation step in our full migration for all services to AWS.
Note: The silver lining here is this was a one-time event that will not be repeated and despite the incident we have left the database in a more resilient state ready for the further improvements we have scheduled.
Issue:
Customers using PNAP storage experienced issues with new study ingestion (uploads) to Ambra. (AWS storage was not impacted see “Note Below”)
Time: 12AM ET to 9PM ET
Root Cause/Resolution:
This issue appears to be related to specific DICOM data being mishandled by our code (apparently an extreme edge case) because of a new job scheduling engine in storage. The corrective action was to restart the grid services on the PNAP storage nodes when they get stalled so that they will process the queue again.
Note: Within AWS Ambra has split out the storage services to be more resilient and dynamic so this stack was not impacted.
Issue:
At approximately 11:30 ET on Aug 4th Ambra monitoring observed services queues rapidly rising and customers started experiencing processing delays.
Duration:
Aug 4, 2022, from 11:30 AM ET to 2:08 PM ET
Resolution:
Next steps:
Issue:
At approximately 2:30 PM ET users started experiencing platform slowness.
Duration:
Aug 3, 2022 from 2:30 PM ET to 3:50 PM ET
Resolution:
The team conducted an analysis on the transactions and was able to identify multiple long running queries that were “killed” to free up database resources.
Next steps:
RCA: Slowness Event, August 18th
Issue:
The Ambra Platform users experienced slow performance on all transaction types including upload, gateway processing and UI actions post August Release.
Time: 6:30AM ET to 2:00PM ET
Root Cause/Resolution:
A load-related problem was identified following our last release that caused major contention at the database level. This was linked to a new feature and while the platform was up and operating for some, major slowness and delays were encountered. To correct the contention this feature was backed out to recover performance for all platform clients.
Next Steps:
Intelerad is investing in building out more mature performance and load testing processes to try to avoid this problem in the future
Security Advisory: CVE-2021-44228 - Ambra
December 12, 2021 - 15:00 EST
The Intelerad team, which now includes Ambra, has been working diligently to assess and mitigate any risks introduced by CVE-2021-44228.
At the time of this Security Advisory we can confirm that Ambra software is not affected by this vulnerability. While some of our software utilizes the Log4j library, the versions utilized are not affected by this specific vulnerability. No further action from our customers is required at this time.
The Intelerad team continues to monitor this threat closely and will provide additional updates as needed. Should you have any further questions, please feel free to open a support ticket or contact your account manager directly.
Thank you,
Intelerad Information Security Team