Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What's the cost and architecture of running such a site and crawling all those pages?


A single bare-metal server!


Oh wow. Do you need a proxy or something to be able to crawl so many pages? Can I email you with some technical questions? Thanks.


No proxy yet, but I am considering one as many sites are re-directing my crawler based on its IP, which is causing indexing issues.

The hardest part BY FAR is the crawler: initially I was using Apache Nutch but it got slower and slower as the index grew, so I replaced it with my own crawler that I wrote in PHP (comfortable for me) and made that multi-threaded using Supervisor.

The second hardest part was the amount of security I had to build in to prevent bots running spam searches and hogging my infra.

I'll try to write a blog soon and post it here.


Do you have multiple IPs? I am trying to build something which needs just the published at and updated at date fields for thousands of links and I am afraid my IP will get blocked quickly.


Just one IP for now. You are right to worry about being blocked from crawling however, it has happened to me already on a few sites. The key things to help mitigate against this are:

1. Always identify your crawler via a consistent user-agent string, that explains its a web search crawler and not a generic web browser.

2. Always obey the directives in robots.txt.

3. Make sure your crawler is not too aggressive (low frequency of requests).

(updated for formatting)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: