What's the cost and architecture of running such a site and crawling all those p...

saltysalt · 2025-08-05T21:51:29 1754430689

A single bare-metal server!

busymom0 · 2025-08-06T01:34:11 1754444051

Oh wow. Do you need a proxy or something to be able to crawl so many pages? Can I email you with some technical questions? Thanks.

saltysalt · 2025-08-06T10:07:04 1754474824

No proxy yet, but I am considering one as many sites are re-directing my crawler based on its IP, which is causing indexing issues.

The hardest part BY FAR is the crawler: initially I was using Apache Nutch but it got slower and slower as the index grew, so I replaced it with my own crawler that I wrote in PHP (comfortable for me) and made that multi-threaded using Supervisor.

The second hardest part was the amount of security I had to build in to prevent bots running spam searches and hogging my infra.

I'll try to write a blog soon and post it here.

busymom0 · 2025-08-06T20:54:31 1754513671

Do you have multiple IPs? I am trying to build something which needs just the published at and updated at date fields for thousands of links and I am afraid my IP will get blocked quickly.

saltysalt · 2025-08-07T13:26:44 1754573204

Just one IP for now. You are right to worry about being blocked from crawling however, it has happened to me already on a few sites. The key things to help mitigate against this are:

1. Always identify your crawler via a consistent user-agent string, that explains its a web search crawler and not a generic web browser.

2. Always obey the directives in robots.txt.

3. Make sure your crawler is not too aggressive (low frequency of requests).

(updated for formatting)