Topic > Case Study: Google Search Engine

Index IntroductionPart of the Design ArchitectureScalability, Availability and SecurityGoogle Distributed File SystemCommunication ProtocolsIntroductionGoogle is recognized as the largest search engine company in the world, with a large number of users around the world world. It runs more than a million servers in data centers around the world, integrates global information, processes hundreds of millions of search requests every day, automatically "browses" each web page and scores them one by one. Users just need to enter the keywords on the search homepage, Google search engine will discover the highest ranked relevant pages among the visited pages and display them in less than a second so that everyone can access and get the desired information .Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get Original Essay Google has managed to grow into a company with a dominant share of the Internet search market, thanks to the effectiveness of the ranking algorithms used in the bottom of its search engine. The underlying search system managed to handle more than 88 billion searches per month. During this time, the main search engine has never experienced an outage and users can expect query results in approximately 0.2 seconds. Part of the design architecture Google search engine is implemented in C or C++, which is efficient and can run on Solaris or Linux. In this section we will provide a high-level overview of how the entire system is designed. At Google, web crawling is performed by severe distributed crawlers. The function of the URL server is to send the list of URLs to the Crawler, then the Crawler will send all the scanned web pages to the store server, then the Repository will compress the web pages and store them in the database. When the system begins analyzing web pages, because each web page has an associated ID number (called a docID), the analyzed URL will be assigned that number. The indexer performs many functions that can read repositories, extract documents and analyze them. Each document is converted into occurrences of a set of words called hits. Hits are used to record words, their position in the text, estimate font size and capitalization. The indexer distributes these results into a series of "barrels", creating a partially ordered forward index. The indexer also has an important function that analyzes all the links in each web page and stores important information about these links in the anchor file. File information can precisely pinpoint the location of each link to and from, and the link.URL resolver text reads the anchors file and converts relative URLs to absolute URLs, then to docIDs. Inserts the anchor text into the direct index, associated with the docID that the anchor points to. It also creates database links for each pair of docIDs. The links database is used to calculate the PageRank of all documents. The sorter takes the barrels, which are sorted by docID, and rearranges them by wordID to generate the inverted index. This operation requires some temporary space. The sorter also generates a list of wordIDs and moves it to the reverse index. The DuffSimulink function generates a new dictionary for the searcher along with the LeX icon generated by the indexer. The searcher is managed by a web server and answers questions using dictionaries created by DopCopION, inverted indices and PageRanks. Scalability, availability and securityFrom the distributed system point of view, the search engineGoogle search is a fascinating case study, capable of handling extremely demanding high demand, especially in terms of scalability, reliability, availability and security. Scalability refers to the effective and efficient operation of distributed systems at different scales (from small business intranets to the Internet). If the number of resources and users increases, the system can still maintain its effectiveness. There are three challenges to achieve scalability. (1) Control the cost of physical resources When the demand for resources increases, we should spend reasonable costs to expand the system and meet the requirements. For example, if a search engine's server cannot handle all of your login requirements, you need to increase the number of servers to avoid performance bottlenecks. In this regard, Google considers scalability in three dimensions: (1) being able to process more data (x) (2) being able to process more queries (y) (3) seeking better results (z). From the data in the Introduction, Google's search engine is undoubtedly very good in these respects. However, to scale, other functions, including indexing, classification, and search, require highly distributed solutions. (2) Control performance loss When the distributed system handles a large number of users or resources, it will produce many data sets. Managing these datasets has a large demand on distributed system performance. In this case, the scalability of the hierarchical algorithm is obviously better than that of the linear algorithm, but the performance loss cannot be completely avoided. Since Google's search engine requires high interaction with users, it is necessary to achieve as low latency as possible. Therefore, the better the performance, the better the ability to complete the network search operation within 0.2 s. Only in this way can Google obtain more profits from the sale of advertisements. Annual advertising revenue reaches $32 billion, which shows that Google is superior to other search engines in processing the performance of related underlying resources, including network, storage and computing resources. (3) Prevent software resource exhaustion The search engine uses 32 bits as the network address. If there are too many Internet addresses, the Internet address will be exhausted. For this Google currently does not have a good solution, because if we use a 128-bit Internet address, there is no doubt that many software components will have to be modified. The availability of the distributed system mainly depends on the extent to which new resources are used. sharing services can be added and used by multiple clients. Since Google's search engine has to meet the highest requirements in the shortest possible time when crawling, indexing and sorting the web, availability is also a high demand. To meet these needs, Google has developed a physical architecture. The middle layer defines an overall distributed system infrastructure, which not only allows the development of new applications and services to reuse the underlying system services, but also ensures the integrity of Google's huge code database. There are many high-value information resources for users in distributed systems. system, so it is very important to protect the security of these resources. Security of information resources includes three parts: confidentiality (to prevent disclosure to unauthorized persons), integrity (to prevent modification or damage), availability (to prevent interference with the means of accessing the resources). When we investigate the safety of the engine ofGoogle research, we found that Google has not been very successful in terms of security, and even publicly admitted to leaking user information to gain benefits, which also pushes users to use Google's software, information security cannot be guaranteed. Google's distributed file system The implementation of Google's file system is to meet the rapid growth of Google's big data processing and management needs. On top of this demand, GFS faces the challenge of managing deployment and the risk of increased hardware failures. Ensuring data security and being able to scale to thousands of computers while managing multiple terabytes of data can therefore be considered the main challenges faced by GFS. So Google made the important decision not to use any of the existing distributed file systems. Instead he decided to develop a new file system. The biggest difference with other file systems is that it optimizes the use of large files (Gigabyte to multi-terabyte), resulting in most files being considered immutable and can be read multiple times with a single file system. write. A GFS cluster consists of a single master and multiple block servers and is accessible by multiple clients. These machines are common Linux process machines that can run user-level server processes. As long as the user's resources allow the block server and client to run concurrently on one machine. Stored files are divided into fixed-size blocks, each with a globally unique 64-bit block handle. Chunk servers are stored on local disks as Linux files, they can read and write at the same time. Block data assigned by block handle and data range. To improve GFS performance, each block must be replicated across at least three servers. Chunk master maintains the metadata of the entire GFS. In a certain period, the Chunk Master will ask each Chunk Server to load the state via HeartBeat messages. The communication containing the data, which does not need to be connected to the Linux Vnode layer, connects directly to the block server. Neither the client nor the block server caches file data. This data storage-free approach not only avoids the inability to cache because the working set is too large, but also makes the client and the entire system consistent. The Linux buffer stores all frequently accessed data in memory, so block servers do not need to cache file data, which greatly improves the performance and speed of GFS. Communication Protocols Setting up and selecting communication protocols is very important to the overall design of a system. Google adopts a simple, minimal and efficient remote calling protocol. Remote call protocol communication requires a serialization component to transform the procedure call data. Therefore, Google has developed a protocol buffer, which is a simplified and high-performance serialization component. Google also uses a separate protocol for publishing and subscribing. Protocol buffers focus on data description and subsequent data serialization. It aims to provide a simple, efficient, and extensible way to specify and serialize data regardless of language and platform. Serialized data can be stored, transferred, or any scenario that requires serialization of the data format. There are three reasons why Google chose to use protocol buffers. The downside of Google's design is that it is not as expressive as XML. Since i.