AWS 披露故障内情:超出了操作系统的线程限制,而系统管理员又不熟悉一些解决方法

租用服务器Q78664972 2个月前 (12-03) IDC资讯 52 0
TAGS:
国内服务器大促销查看详情
AWS近日披露,为一套原本就很复杂的系统添加容量是上周其US-EAST-1地区出现了影响面颇广的计划外停运的原因。
长话短说,该公司的Kinesis服务添加了更多的容量,该服务由众多客户直接使用,并支持AWS自身运营系统的其他部分。Kinesis集群中的服务器需要彼此进行通信,而这么一来为前端集群中的另外每一台服务器创建了新线程。AWS表示,这牵涉“成千上万台服务器”;添加新服务器后,有关添加新服务器的消息最多可能需要一个小时才能传送到整个集群。
因此,添加容量“导致集群中的所有服务器超过了操作系统配置所允许的最大线程数量。”
AWS已搞清楚了这一点,但也明白解决这个问题意味着重新启动所有Kinesis。
而有时只可以让“几百台”服务器恢复正常,正如我们上面所看到的,Kinesis使用了“成千上万台服务器”。
这就可以解释为什么在故障发生后恢复很慢。
AWS的这篇文章对整起事件作了极其详细的解释,还解释了将来如何计划避免此类事件。
计划一:使用更庞大的服务器。
文章称:“在短期内,我们将转向更庞大的CPU和内存服务器,从而减少服务器总数,因此减少每台服务器在整个集群之间进行通信所需要的线程。”文章解释道,这么一来“将在所使用的线程数量方面留出相当大的余地,因为每台服务器必须维护的总线程数与集群中的服务器数量成正比。”
该公司还计划对“服务中的线程使用情况实施新的细粒度警报”,并计划“加大操作系统配置中的线程数量限制,我们认为这将为我们每台服务器提供更多的线程,并为我们带来相当大的额外安全余地。”
已在工作议程上的还包括:隔离CloudFront等需求量大的服务,以便使用专用的Kinesis服务器。
仪表板受依赖项的拖累
这篇文章还概述了为什么亚马逊的仪表板提供的有关这起事件的信息寥寥无几,因为仪表板也离不开依赖Kinesis的服务。
AWS开发了一种依赖项较少的方法,以便将信息发送到它用作公共状态页面的服务运行状况仪表板(Service Health Dashboard)。该文章声称,该仪表板如预期的那样工作,但“在这起事件的早期,我们遇到了几次延误……因为它对我们的支持操作人员来说是一种更加手动且不那么熟悉的工具。”
因此,AWS云使用了“个人运行状况仪表板”(Personal Health Dashboard),仅对受影响的客户可见。
文章末尾致歉:“虽然我们为自己在Amazon Kinesis方面以往的长期表现感到自豪,但我们知道这项服务以及受影响的其他AWS服务对于我们的客户、其应用软件、最终用户以及最终用户的业务来说有多么至关重要。”
“我们将竭尽所能,从这起事件中汲取经验教训,以便进一步提高可用性。”
AWS故障报告全文:
Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region
We wanted to provide you with some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on November 25th, 2020.
Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services. These services also saw impact during the event. The trigger, though not root cause, for the event was a relatively small addition of capacity that began to be added to the service at 2:44 AM PST, finishing at 3:47 AM PST. Kinesis has a large number of “back-end” cell-clusters that process streams. These are the workhorses in Kinesis, providing distribution, access, and scalability for stream processing. Streams are spread across the back-end through a sharding mechanism owned by a “front-end” fleet of servers. A back-end cluster owns many shards and provides a consistent scaling unit and fault-isolation. The front-end’s job is small but important. It handles authentication, throttling, and request-routing to the correct stream-shards on the back-end clusters.
The capacity addition was being made to the front-end fleet. Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map. This information is obtained through calls to a microservice vending the membership information, retrieval of configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers. For the latter communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants.

At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records. Teams engaged and began reviewing logs. While the new capacity was a suspect, there were a number of errors that were unrelated to the new capacity and would likely persist even if the capacity were to be removed. Still, as a precaution, we began removing the new capacity while researching the other errors. The diagnosis work was slowed by the variety of errors observed. We were seeing errors in all aspects of the various calls being made by existing and new members of the front-end fleet, exacerbating our ability to separate side-effects from the root cause. At 7:51 AM PST, we had narrowed the root cause to a couple of candidates and determined that any of the most likely sources of the problem would require a full restart of the front-end fleet, which the Kinesis team knew would be a long and careful process. The resources within a front-end server that are used to populate the shard-map compete with the resources that are used to process incoming requests. So, bringing front-end servers back online too quickly would create contention between these two needs and result in very few resources being available to handle incoming requests, leading to increased errors and request latencies. As a result, these slow front-end servers could be deemed unhealthy and removed from the fleet, which in turn, would set back the recovery process. All of the candidate solutions involved changing every front-end server’s configuration and restarting it. While the leading candidate (an issue that seemed to be creating memory pressure) looked promising, if we were wrong, we would double the recovery time as we would need to apply a second fix and restart again. To speed restart, in parallel with our investigation, we began adding a configuration to the front-end servers to obtain data directly from the authoritative metadata store rather than from front-end server neighbors during the bootstrap process.

At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters. We didn’t want to increase the operating system limit without further testing, and as we had just completed the removal of the additional capacity that triggered the event, we determined that the thread count would no longer exceed the operating system limit and proceeded with the restart. We began bringing back the front-end servers with the first group of servers taking Kinesis traffic at 10:07 AM PST. The front-end fleet is composed of many thousands of servers, and for the reasons described earlier, we could only add servers at the rate of a few hundred per hour. We continued to slowly add traffic to the front-end fleet with the Kinesis error rate steadily dropping from noon onward. Kinesis fully returned to normal at 10:23 PM PST.

For Kinesis, we have a number of learnings that we will be implementing immediately. In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet. This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet. Having fewer servers means that each server maintains fewer threads. We are adding fine-grained alarming for thread consumption in the service. We will also finish testing an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well. In addition, we are making a number of changes to radically improve the cold-start time for the front-end fleet. We are moving the front-end server cache to a dedicated fleet. We will also move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet. In the medium term, we will greatly accelerate the cellularization of the front-end fleet to match what we’ve done with the back-end. Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range. This had been under way for the front-end fleet in Kinesis, but unfortunately the work is significant and had not yet been completed. In addition to allowing us to operate the front-end in a consistent and well-tested range of total threads consumed, cellularization will provide better protection against any future unknown scaling limit.

There were a number of services that use Kinesis that were impacted as well. Amazon Cognito uses Kinesis Data Streams to collect and analyze API access patterns. While this information is extremely useful for operating the Cognito service, this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers. As a result, Cognito customers experienced elevated API failures and increased latencies for Cognito User Pools and Identity Pools, which prevented external users from authenticating or obtaining temporary AWS credentials. In the early stages of the event, the Cognito team worked to mitigate the impact of the Kinesis errors by adding additional capacity and thereby increasing their capacity to buffer calls to Kinesis. While this initially reduced impact, by 7:01 AM PST errors rates increased significantly. The team was working in parallel on a change to Cognito to reduce the dependency on Kinesis. At 10:15 AM PST, deployment of this change began and error rates began falling. By 12:15 PM PST, error rates were significantly reduced, and by 2:18 PM PST Cognito was operating normally. To prevent a recurrence of this issue, we have modified the Cognito webservers so that they can sustain Kinesis API errors without exhausting their buffers that resulted in these user errors.

CloudWatch uses Kinesis Data Streams for the processing of metric and log data. Starting at 5:15 AM PST, CloudWatch experienced increased error rates and latencies for the PutMetricData and PutLogEvents APIs, and alarms transitioned to the INSUFFICIENT_DATA state. While some CloudWatch metrics continued to be processed throughout the event, the increased error rates and latencies prevented the vast majority of metrics from being successfully processed. At 5:47 PM PST, CloudWatch began to see early signs of recovery as Kinesis Data Stream’s availability improved, and by 10:31 PM PST, CloudWatch metrics and alarms fully recovered. Delayed metrics and log data backfilling completed over the subsequent hours. While CloudWatch was experiencing these increased errors, both internal and external clients were unable to persist all metric data to the CloudWatch service. These errors will manifest as gaps in data in CloudWatch metrics. While CloudWatch currently relies on Kinesis for its complete metrics and logging capabilities, the CloudWatch team is making a change to persist 3-hours of metric data in the CloudWatch local metrics data store. This change will allow CloudWatch users, and services requiring CloudWatch metrics (including AutoScaling), to access these recent metrics directly from the CloudWatch local metrics data store. This change has been completed in the US-EAST-1 Region and will be deployed globally in the coming weeks.

Two services were also impacted as a result of the issues with CloudWatch metrics. First, reactive AutoScaling policies that rely on CloudWatch metrics experienced delays until CloudWatch metrics began to recover at 5:47 PM PST. And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates. At 10:36 AM PST, engineers took action to mitigate the memory contention, which resolved the increased error rates for function invocations. 

CloudWatch Events and EventBridge experienced increased API errors and delays in event processing starting at 5:15 AM PST. As Kinesis availability improved, EventBridge began to deliver new events and slowly process the backlog of older events. Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS) both make use of EventBridge to drive internal workflows used to manage customer clusters and tasks. This impacted provisioning of new clusters, delayed scaling of existing clusters, and impacted task de-provisioning. By 4:15 PM PST, the majority of these issues had been resolved.

Outside of the service issues, we experienced some delays in communicating service status to customers during the early part of this event. We have two ways of communicating during operational events – the Service Health Dashboard, which is our public dashboard to alert all customers of broad operational issues, and the Personal Health Dashboard, which we use to communicate directly with impacted customers. With an event such as this one, we typically post to the Service Health Dashboard. During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event. We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues. We also posted a global banner summary on the Service Health Dashboard to ensure customers had broad visibility into the event. During the remainder of event, we continued using a combination of the Service Health Dashboard, both with global banner summaries and service specific details, while also continuing to update impacted customers via Personal Health Dashboard. Going forward, we have changed our support training to ensure that our support engineers are regularly trained on the backup tool for posting to the Service Health Dashboard.

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
相关阅读:





文章来源: 云头条

版权声明:部分文章内容、图片来源于互联网获取,如有侵权请联系删除,发送邮件:server889#qq.com 请将#改为@,我们将第一时间审核处理!

相关推荐

网友评论

  • (*)

最新评论

相关推荐

php源码论坛源码香港vps数据中心西丽机房江阴数据西安机房win7腾讯管家门户整站PHP系统小程序香港服务器服务器租用电影网站SEO迅雷看看装修网站装饰整站系统NAS和服务器香港主机海外服务器香港机柜江阴市云计算数据中心高防服务器deituiCMS得推分类香港服务器cn2香港机房下载网站游戏服务器国外服务器端口描述香港 GIA-VPS宜兴国际数据中心网站服务器linux系统服务器屏蔽网站外贸公司韩国服务器独立服务器韩国高配低价服务器西安高新电信机房大型商城超市PHP专用服务器建设网站采集站网站站群广渠门IDC机房香港CN2CDNDDoS防火墙香港固定IP专线新之洲数据韩国机房CDN加速自动采集免费源码高清壁纸香港美国VPSBGP多线国内服务器黑客攻击访问速度高防云主机本溪市南地IDC机房服务器机房区块链小说网站维护网站西安联通机房服务器托管棋牌游戏棋牌服务器服务器配置企业服务器香港将军澳CN2 VPS北京大兴大族机房BGP服务器硬件网站建设云服务器大连黑石礁机房香港数据中心数据库服务器数据库系统香港云服务器襄阳联通枢纽楼IDC机房广告联盟联盟服务器WSTShop个人网店人才网独立云服务器企业云服务器服务器访问速度中国移动(广东湛江)数据中心交友网站婚恋网站主机网站优化香港日本韩国中国移动安徽(淮南)数据中心小程序服务器衡阳服务器租用