Improving Service Availability in Cloud Systems

Topic > Improving Service Availability in Cloud Systems

IndexA Survey on Improving Service Availability of Cloud SystemsReviewPersonal InsightsA Survey on Drizzle: Fast and AdaptableDescriptionReviewPersonal InsightsA Survey on Soteria: Automated IoTDescriptionReviewPersonal InsightsConclusionReferencesA Survey on Improving Service Availability of Cloud SystemsThe Cloud computing is a shared networking practice that uses a network of remote servers hosted on the Internet. As a service, it has grown in popularity due to its complexity and low costs, as there is no need to purchase the IT infrastructure, hardware, or licenses needed to operate a physical computer network [9]. Instead, these cloud systems use many physical disk drives as the primary storage component of the cloud system. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Many technology companies such as Amazon, Google, and Microsoft have incorporated cloud computing into the foundation of their applications' online services. These applications are used by millions of different people in all time periods. For this reason, service availability must be of the utmost importance. However, despite highly targeted service availability expectations, these services are still prone to failures, resulting in customer dissatisfaction and lost revenue [9]. These failures can be the result of a number of hardware problems, but the most important one is, in fact, disk failure. Large-scale cloud systems typically use several hundred million disk drives, of which 20% to 57% have experienced at least one failure in the last 4 to 6 years [9]. This percentage validates the significance of these errors and the importance of predicting disk failures in order to improve service availability in cloud services. To achieve this, many solutions have been proposed that use historical disk level sensor data (SMART data) to predict disk failures and take preventative actions, such as replacing the faulty disk. The different proposed approaches similarly focus on predicting complete disk failure [9] . Unfortunately, before failure-prone disks are proactively replaced, numerous disk failures can occur, which can negatively impact higher-level services. These errors, called “gray failures,” are typically undetected errors that degrade the quality of cloud software. This paper introduces CDEF (Cloud Disk Error Forecasting), an innovative approach to proactive disk failure prediction that uses both SMART data and system-level error reflection signals to better detect these gray failures [9]. This approach was evaluated using data from Microsoft production cloud systems and was shown to be an improvement over baseline methods with a reduction of sixty-three thousand minutes of Microsoft Azure virtual machine downtime per month. Review The authors faced two major challenges when designing the CDEF prediction model for a large-scale cloud computing service. Running an industrial cloud system like Microsoft Azure requires many disk drives, which brings the first challenge: only three out of ten thousand disks can potentially become faulty on any given day [9]. With such a low percentage, it would be easy for any prediction model to simply classify all disks as healthy, as this would result in the least chance of error. Other approaches have used techniquesrebalancing to address this issue and create better results, but as a result they created false positives, which ultimately reduce the accuracy of a prediction model [9]. Another challenge presented comes from using historical data to make predictions. Some information used (particularly system-level signals) is both time- and environmentally sensitive, meaning that the data for a specific disk is constantly changing throughout its life within a cloud environment. When using test datasets, prediction models can be accurate, but in practice for future predictions of datasets it is much less so [9]. The authors achieve the goal presented in this article by overcoming these difficulties with the introduction of two new features: a failure-proneness classification system designed for disk drives and a selection tool that determines which SMART dataset features they can provide the greatest distinction between a healthy or error-prone disk [9]. With a feature identification system, CDEF is able to filter through a multitude of different SMART and system-level disk datasets and identify which ones are the most optimal in determining healthy and faulty disks [9]. Providing a filtered set of historical data containing data relevant to accurate disk failure detection allows prediction models to focus on important characteristics of a disk drive to ensure that gray failures do not go unnoticed. Rather than opting for the simple approach taken by existing systems and classifying a disk as faulty or not, CDEF instead classifies disks based on their failure potential [9]. The previously mentioned problem of imbalanced datasets is significantly mitigated because the new forecasting perspective does not focus on data imbalance. Since most disks are classified as healthy, this approach more effectively examines each disk to ensure that healthy disks are also optimal. The real novelty of this work is shown in these solutions and their ability to integrate each other. Not only can these solutions represent an improvement over other approaches, but the combined precision of the feature selection method and classification model results in more efficient and cost-effective results than existing methods. Although cross-validation approaches used in other methods present better results than the CDEF approach, the results of the CDEF approach better reflect the results of actual tests of the prediction model. This is because cross-validation does not take into account the time sensitivity of disk data. Furthermore, the CDEF approach has already been applied to the Microsoft Azure cloud service [9] and has proven effective in selecting healthy disks for the service. Considering that there are many issues that affect the functionality of cloud systems, the work done by the authors is significant in highlighting existing problems and implementing a solution to one of the most serious problems. Personal Insights The authors of the CDEF approach do, in fact, achieve the goal addressed at the beginning of the document, namely to develop online prediction software capable of distinguishing between healthy and faulty disk drives in a cloud system in order to improve its functionality. To create this software it was necessary to adopt some methods that adopt machine learning techniques, such as the FastTree algorithm [5] used in the CDEF classification functionality. The algorithm was particularly interesting because it is available in Microsoft's Python library and, considering that this prediction model wastested using a dataset provided by Microsoft Azure Systems [6]. This presents some problems for a cloud system like Apple's iCloud, which is unable to adopt Microsoft's proprietary libraries since iCloud runs on the Swift programming language developed by Apple for most of its services [2]. This could be problematic for Apple's cloud services because service availability could lag behind Microsoft Azure, Google Cloud, and Amazon AWS if the CDEF approach becomes more popular. The authors mention in the conclusion that there are many ways to extend this work, perhaps something to consider in the future would be to try implementing this approach to Apple's cloud computing service. A Survey on Drizzle: Fast and Adaptable Stream Processing at ScaleAlexander Monaco[ email protected]Florida International University (FIU), Miami, FloridaDescriptionStream processing is a type of "Big Data" technology that is used to process data as it "flows " both in production and in clean sheet and manifests itself in the receiving side. This type of action is used with data related to stock trading, traffic monitoring, smart devices, or any type of information needed to be detected and interrogated in a short amount of time. Because data travels incredibly fast and in varying quantities, stream processing systems must be able to adapt to these changes while maintaining high performance standards. In addition to being able to adapt to these changes, stream processing systems must be able to simultaneously maintain high throughput (task performance) and low latency (amount of time data travels between nodes) [ 7]. Existing approaches mainly view the above-mentioned problems as mutually exclusive solutions, resulting in high-adaptability and high-latency systems or systems with low latency during normal operations, but expensive adaptability. The paper introduces Drizzle [7], a stream processing system developed with the understanding that both previously mentioned solutions have features that can be combined to improve adaptability and reduce latency in tandem. Review The authors use their article not only to introduce Drizzle, but also to present two main approaches in existing solutions: continuous operator streaming (e.g. Naiad and Apache Flink) and bulk synchronous processing (e.g. Spark Streaming and FlumeJava) [7]. Their strengths and weaknesses and the features implemented to create a new approach in stream processing that is fast and adaptable are shared in the paper. The first approach analyzed, bulk synchronous processing, is a popular computing framework in which a barrier is used to allow parallel nodes in a system to perform a local computation. In stream processing, this method is modified to some extent to create a subset of processes and set the amount of processing time in seconds. Similar to the basic bulk synchronous method, these processes in the subgroup collect data, analyze it, and then terminate in a barrier that returns data from all processes in the subgroup. This type of approach is advantageous because the barriers allow the streaming system to take “snapshots”, that is, record physical or logical information, of each process, which results in high adaptability and fault tolerance [7]. However, while it is adaptable and secure, the time allocated to each of these processes cannot be low enough to create low latency and this would result in the processesspent more time communicating the results with the driver than actually processing them. The continuous operator streaming approach eliminates planning and communication with drivers and implements a barrier only when necessary. When data enters the system, its operators are stored and processed as a long-running task. Unlike bulk synchronous processing, continuous streaming of operators uses checkpoints rather than barrier snapshots to recover from failures [7]. Overall, this approach prioritizes flexibility and speed over security and cost-efficiency. If a node in this system fails, all nodes must restart at a checkpoint and be replayed. My fascination with Drizzle lies in its novelty and how the features that make both approaches effective are combined. The bulk synchronous processing method is used for task scheduling and fault tolerance, while high throughput and low latency are achieved by continuous operator methods. Personal Insights Of the two combined approaches used in Drizzle, the one that required the most improvement in implementation was the massive processing method. synchronous processing. Bulk synchronous processing uses barriers to simplify fault tolerance and increase adaptability. However, when trying to reduce latency in a system, many barriers reduce the processing time needed to communicate with a centralized driver, causing an overload situation. Therefore, creative decisions were made against the barriers in Drizzle [7]. Another work, titled "Breaking the mapreduce stage barrier", also addresses how barriers reduce performance and introduces techniques and algorithms that operate without barriers to maximize performance [8]. The authors plan to explore additional techniques to improve Drizzle's performance. Perhaps a good start would be to find a way to implement barrier-free functionality while maintaining Drizzle's level of adaptability and fault tolerance. A Survey on Soteria: Automated IoTSafety and Security AnalysisAlexander Monaco[email protected]Florida International University (FIU), Miami, FloridaDescriptionThe Internet of Things is a concept that has become more important to individuals as the technologies classified under it become more advanced. IoT broadly refers to any technology that is connected to each other digitally, such as smartphones, computers, smart cars, smart TVs, and so on. Unfortunately, the increased convenience of connected devices comes with many security issues, even though many of these IoT technologies have been highly advanced since the birth of IoT. Many technology companies have established guidelines describing how to regulate security within devices [3], but there are not many tools and algorithms that evaluate IoT safety and security. This paper introduces Soteria, a statistical analysis system for IoT application and environmental safety validation [3]. To begin validating an IoT application or environment, the source code must be translated into an intermediate representation (IR), which is a data structure or code used by a compiler to represent the source code. With an IR, Soteria then creates a model of the application lifecycle, entry points, event handling methods and call graphs. Furthermore, IR is then used to extract an application state model that contains the states and transitions within the IoT application. Soteria uses then})