No proxy yet, but I am considering one as many sites are re-directing my crawler based on its IP, which is causing indexing issues.
The hardest part BY FAR is the crawler: initially I was using Apache Nutch but it got slower and slower as the index grew, so I replaced it with my own crawler that I wrote in PHP (comfortable for me) and made that multi-threaded using Supervisor.
The second hardest part was the amount of security I had to build in to prevent bots running spam searches and hogging my infra.
Do you have multiple IPs? I am trying to build something which needs just the published at and updated at date fields for thousands of links and I am afraid my IP will get blocked quickly.
Just one IP for now. You are right to worry about being blocked from crawling however, it has happened to me already on a few sites. The key things to help mitigate against this are:
1. Always identify your crawler via a consistent user-agent string, that explains its a web search crawler and not a generic web browser.
2. Always obey the directives in robots.txt.
3. Make sure your crawler is not too aggressive (low frequency of requests).