Ambra Status

Have Questions? Contact Us:
support@ambrahealth.com

Ambra Platform Announcements

RCA: Partial Outage, April 10th

Issue:
Customers experienced errors on gateway transactions  

Time: 10:16am EDT to 10:40am EDT

Root Cause:
SSL certificate for dicomgrid.com expired. No monitoring rule in place for this certificate in AWS.

Resolution:
SSL certificate for dicomgrid.com was renewed.  We found that some of the alerts were not configured correctly on some devices in AWS that were added recently. We checked and modified all devices that were missing the certificate and have updated our playbook so that this monitor will be deployed to all new devices in the future

UPDATE: Upcoming April Release Dates

The upcoming release dates for April have been modified to the following: 

4/12/23 - EU, Asia, APAC Production (previously 4/5/23)
4/13/23 - Private Clouds (previously 4/6/23)
4/19/23 - US Production (previously 4/12/23)

If you have any questions or concerns, please connect directly with your Customer Success Manager, or submit a ticket via the Service Now portal

RCA: Partial Outage, March 29th

Issue:
Some customers experienced study ingestion errors, viewing errors, and at times other errors when logging in or navigating the application. 
Time: 4:12pm EDT to 5:54pm EDT

Root Cause:
One of our DNS servers began intermittently timing out, which caused name lookups to sometimes fail. This caused errors connecting to an upstream network service, leading to local disks filling up with the data that couldn't be sent over the network.

Resolution:
The platform team resolved the errors by pointing our servers at an alternate DNS resolver. Then they cleaned up old files from disk, and restarted the affected services. 

Next Steps:
We are further investigating the cause of the DNS errors and why systems didn't automatically utilize a secondary resolver.

RCA: Performance Event, March 8th
Issue
Users experienced some delays and sporadic errors ingesting or accessing studies
Time: 12:20PM ET to 12:36PM ET

Root Cause
A caching component within our storage infrastructure began exceeding its maximum permitted network bandwidth, leading to throttling of its network traffic. While the system remained available and use continued, this condition caused delays and sporadic errors ingesting or accessing studies at times.

Resolution
We have increased the resource sizes of the affected systems to allow for much greater network capacity, and we are investigating possible architectural changes to increase the network efficiency of this cache.
RCA: Event, March 3rd
Issue:
A brief service disruption that impacted user logins, study ingestion, and viewing images.
Time:  7:41 PM ET to 9:13 PM ET

Root Cause
A firewall in the PNAP data center failed. The failure resulted triggered a Secondary firewall the route to AWS failed to be injected into the routing table.  

Resolution:
Manual intervention was required manually update the routing table. Additionally, new monitoring has been established upstream of all our networking components to detect a drop in ingested studies.  This should help improve response times if a failure of this nature or any other occur in the future.
RCA: Partial Outage Event, Feb 22nd

Issue:
Newly ingested studies not available in Ambra.
Time: 3:00PM ET to 3:45PM ET
Duration: 45 minutes

Root Cause:
A performance issue within our services stack was identified that tied back to the storage issues that began last week on 2/16. Investigation revealed that a services health check was failing due to slower than expected responses from the storage nodes. 

Resolution:
1) To resolve the issue the services health check process timeout parameter was increased.           
2) A code update for storage was deployed the evening of 2/22 which should further prevent this issue from recurring as well as complete the final stabilization efforts from last week's incidents that began on 2/16.

RCA: Partial Outage Event, Feb 17th

Issue:
Ambra platform was degraded. Users/ gateways were unable to upload or view studies.
Time: 2:07 PM EST to 2:34 PM EST
Duration: 27 Min

Root Cause:
The DevOps team deployed a small code change to a single server to kick off the full fix to the resiliency issues encountered on Feb 16th. This change collided with a rare database maintenance procedure that then impacted all storage servers causing an interruption in service.

Resolution:
Since this event we have made code and process changes such that both restarts of our storage service and new code deployments should not intersect with running database maintenance processes, making these changes lower-risk going forward.

RCA: Partial Outage Event, February 16th

Issue #1
Time: 12:16 PM EST to 12:37 PM EST  
Duration: 21 Minutes 

Root Cause:  
Storage nodes thread usage gradually increased during the day and began hitting system resource limits. Investigation revealed that this was due to the combination of additional storage bins being added and a previously unknown environmental limitation.  As this limit was hit, the nodes became intermittently unresponsive. 

Resolution:  
Due to these limits, the automated recovery was unable to complete and a manual restart of the affected nodes was required.  After the restart the systems began responding normally. 
We continue to closely monitor the resource usage.
 

Issue #2:  
Time: 2:43 PM EST to 4:20 PM EST 
Duration: 1 hour 37 min

Root Cause:  
Application nodes were unable to process incoming requests as quickly as they were being received and became saturated.  Investigation is ongoing to determine whether this was caused by the backlogs created by the earlier outage. 

Resolution:  
The impact was gradually mitigated by provisioning additional interactive services nodes.

 

 

 

RCA: Partial Outage Event, February 14th

Issue:  
The Ambra Platform was degraded and study ingestion stalled.

Impact Times:  1:57PM EST to 2:39PM EST and  4:15PM EST to 5:15PM EST. 
                                                                         
Root Cause:  
During a period of  time on 2/14 when the database was experiencing spikes in load during periods of peak user activity an internal database maintenance process was automatically invoked. The combination of events (load spike and db maintenance process) caused slow database write operations and led to delays in ingesting new studies.

Resolution: 
To resolve the issue Ambra had to perform emergency maintenance. We increased the specs of the database server in order to be able to better handle unexpected spikes in activity.  This maintenance took approximately 30 minutes and during that time storage became unavailable at several point which also prevented end users from viewing studies.  Once the emergency maintenance was completed the database started performing a normal levels and the queues returned to normal.

RCA: Partial Outage Event, January 26th

Issue:

Study ingestion failing for studies

Root Cause/Remediation:

As part of the release there was a change to where DICOM Files were stored. AWS auto scaling failed to increase the resources fast enough to handle the sudden change in load on the new storage location.

After discussing the issue with AWS it was determined that the fastest way to restore normal functionality to the Ambra platform was to revert back to the previously used storage location. Once the change was reverted the platform quickly returned to normal operation.

Next Steps:

The Ambra team is working with AWS to ensure that this doesn’t happen again and will update our processes accordingly.

Update to January Release Schedule

The Ambra team has updated the upcoming release schedule to the following dates: 

1/18/23 - EU, Asia, APAC UAT
1/19/23 - Private Clouds
1/25/23 - US Production, APAC Production

If you have any questions or concerns, please connect directly with your Customer Success Manager, or submit a ticket via the Service Now portal

CVE-2021-44228

Security Advisory: CVE-2021-44228 - Ambra

December 12, 2021 - 15:00 EST


The Intelerad team, which now includes Ambra, has been working diligently to assess and mitigate any risks introduced by  CVE-2021-44228.


At the time of this Security Advisory we can confirm that Ambra software is not affected by this vulnerability.  While some of our software utilizes the Log4j library, the versions utilized are not affected by this specific vulnerability.  No further action from our customers is required at this time.


The Intelerad team continues to monitor this threat closely and will provide additional updates as needed.  Should you have any further questions, please feel free to open a support ticket or contact your account manager directly.


Thank you,

Intelerad Information Security Team

Gateway Directory

Once you have an Ambra Gateway installed you can request a PACS to PACS connection with any facility in the Ambra Gateway Network, visible in the Gateway Directory.

Request a Gateway →