rrainn Spider

General

rrainn Spider is a web crawler used for multiple rrainn projects. These projects include (but are not limited to), contextual search experiences, web search engines, and more. You might see rrainn Spider make requests to your site to gather information about it's contents.

Identifying rrainn Spider

All rrainn Spider requests will contain a User-Agent header with the value set to a string similar to rrainnSpider/1.0.0 (+https://rrainn.com/spider) (the 1.0.0 segment will change depending on what rrainn Spider version accessed your site).

Additionally, rrainn Spider has reverse DNS setup for every IP address that rrainn Spider will make requests from. All of these reverse DNS entries will end with spider.rrainn.com. For example, you can use the host command to check to ensure the IP address comes from a spider.rrainn.com subdomain:

$ host 165.232.132.144
144.132.232.165.in-addr.arpa domain name pointer 165-232-132-144.spider.rrainn.com.

Since anyone can point to a spider.rrainn.com domain name for an IP address reverse DNS query, it is important to verify that the DNS record also points to the IP address:

$ host 165-232-132-144.spider.rrainn.com
165-232-132-144.spider.rrainn.com has address 165.232.132.144

If the reverse DNS entry does not end with spider.rrainn.com or the spider.rrainn.com domain name does not point to IP address, the request was not made from rrainn Spider.

Improve crawl content

rrainn Spider looks for many things when crawling your site. One of which is a script[type=application/ld+json] object. Creating that object with a schema.org @context will help rrainn Spider gather data about your site. Additionally adding a Sitemap to your robots.txt file will greatly assist rrainn Spider in crawling your site.

Limit rrainn Spider request frequency

rrainn Spider takes multiple steps to help ensure your servers aren't overloaded with requests. However, if you would like to further limit the frequency of requests made by rrainn Spider, you can use the robots.txt Crawl-Delay option. Within your robots.txt file, rrainn Spider will look for the rrainnSpider User-Agent, and fall back to * User-Agent if rrainnSpider does not exist. rrainn Spider will interpret this value as the number of seconds rrainn Spider should wait before making subsequent requests.

Additionally, rrainn Spider will honor all 429 HTTP status code responses. In the event you do not send a Retry-After header, rrainn Spider will default to not crawling your site again for 24 hours.

Prevent rrainn Spider from crawling my site

rrainn Spider respects the robots.txt file. All Allow and Disallow commands will be respected by rrainn Spider (with the exception of the robots.txt file itself). rrainn Spider use the rrainnSpider User-Agent, and fall back to * User-Agent if rrainnSpider does not exist in your robots.txt file. In the future we plan to add support to prevent crawling based on meta[name=robots] tags, however this functionality is not currently part of rrainn Spider.

Our spider will also respect all a[rel=nofollow] links. We will not add these links to our system to be crawled. However, if other places link to the resulting URL, we will crawl the URL, but not attribute it to the site that has the a[rel=nofollow] tag.

Finally, you can add a meta[name=rrainnSpider] or meta[name=robots] tag (the former taking priority) with a content attribute to add additional restrictions. We support the following commands for that value:

Other information

rrainn Spider Development

We also run a rrainn Spider Development system as well. More information about that can be found here.

Additional questions

We would be more than happy to answer and additional questions, resolve issues, or listen to feedback about rrainn Spider. Please email us at [email protected] and we will get back to you as soon as possible.