The Anatomy of a Large-Scale Hypertextual Web Search Engine. by Sergey Brin and Lawrence Page
Me resulta bastante difícil de creer que no haya colgado este link antes, mucho antes.
Léanlo, es el paper donde Mr Page Y Mr Rank, er Brin (gag by cam), presentan los fundamentos detrás de Google.
Léanlo, más allá del interés que les despierte el tema, es un muy buen ejemplo de un paper de Ingeniería, algo nada común en el mundo académico.
Esta parte esta increible:
=========================================
It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the
first one they have seen. Almost daily, we receive an email something like, «Wow, you looked at a lot of pages from my web site. How did you like it?»
There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, «This page is copyrighted and should not be indexed», which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For
example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and
servers, it is virtually impossible to test a crawler without running it on large part of the Internet.
Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
=========================================
Desde mi punto de vista, es un excelente artículo, una excelente lectura.
Esta muy bueno si. Los tipos, o por lo menos yo entendi eso,
todavia no tenian intenciones de armarse esos clusters baratos que hoy dia tienen,
todavia tenian el enfoque centralizado del asunto.
Es bastante impresionante la cantidad de conceptos que tiran y lo obvio que les resultan.
es maravilloso. un ejemplo de humildad.