rrainn Spider

General

rrainn Spider is a web crawler used for multiple rrainn projects. These projects include (but are not limited to), contextual search experiences, web search engines, and more. You might see rrainn Spider make requests to your site to gather information about it's contents.

Identifying rrainn Spider

All rrainn Spider requests will contain a User-Agent header with the value set to a string similar to rrainnSpider/1.0.0 (+https://rrainn.com/spider) (the 1.0.0 segment will change depending on what rrainn Spider version accessed your site).

Additionally, rrainn Spider has reverse DNS setup for every IP address that rrainn Spider will make requests from. All of these reverse DNS entries will end with spider.rrainn.com. For example, you can use the host command to check to ensure the IP address comes from a spider.rrainn.com subdomain:

$ host 64.23.247.245
245.247.23.64.in-addr.arpa domain name pointer 64-23-247-245.spider.rrainn.com.

Since anyone can point to a spider.rrainn.com domain name for an IP address reverse DNS query, it is important to verify that the DNS record also points to the IP address:

$ host 64-23-247-245.spider.rrainn.com
64-23-247-245.spider.rrainn.com has address 64.23.247.245

If the reverse DNS entry does not end with spider.rrainn.com or the spider.rrainn.com domain name does not point to IP address, the request was not made from rrainn Spider.

Improve crawl content

rrainn Spider looks for many things when crawling your site. One of which is a script[type=application/ld+json] object. Creating that object with a schema.org @context will help rrainn Spider gather data about your site. Additionally adding a Sitemap to your robots.txt file will greatly assist rrainn Spider in crawling your site.

Limit rrainn Spider request frequency

rrainn Spider takes multiple steps to help ensure your servers aren't overloaded with requests. However, if you would like to further limit the frequency of requests made by rrainn Spider, you can use the robots.txt Crawl-Delay option. Within your robots.txt file, rrainn Spider will look for the rrainnSpider User-Agent, and fall back to * User-Agent if rrainnSpider does not exist. rrainn Spider will interpret this value as the number of seconds rrainn Spider should wait before making subsequent requests.

Additionally, rrainn Spider will honor all 429 HTTP status code responses. In the event you do not send a Retry-After header, rrainn Spider will default to not crawling your site again for 24 hours.

Prevent rrainn Spider from crawling my site

rrainn Spider respects the robots.txt file. All Allow and Disallow commands will be respected by rrainn Spider (with the exception of the robots.txt file itself). rrainn Spider use the rrainnSpider User-Agent, and fall back to * User-Agent if rrainnSpider does not exist in your robots.txt file. In the future we plan to add support to prevent crawling based on meta[name=robots] tags, however this functionality is not currently part of rrainn Spider.

Our spider will also respect all a[rel=nofollow] links. We will not add these links to our system to be crawled. However, if other places link to the resulting URL, we will crawl the URL, but not attribute it to the site that has the a[rel=nofollow] tag.

Finally, you can add a meta[name=rrainnSpider] or meta[name=robots] tag (the former taking priority) with a content attribute to add additional restrictions. We support the following commands for that value:

all - no restrictions
nofollow - will not follow any links on the page
noindex - will not index the page to be linked to, or suggested in rrainn products
none - same as nofollow, noindex

Other information

We work hard to ensure that rrainn Spider respects your site, hosting infrastructure, and overall internet community. If you feel like rrainn Spider is doing something malicious please email us at [email protected], so we can work to resolve your concerns.
rrainn Spider may cache your robots.txt file for up to 60 minutes. In the future we plan to add a system to enable website owners/maintainers to force purge that cache to force rrainn Spider to re-retrieve your robots.txt file before accessing your site again, however this functionality is not currently available. If you need to force purge your robots.txt file from our cache, please email [email protected].
An inaccessible robots.txt file is treated as allow all with no restrictions. This can either be due to an error status code returned, timeout, or other network errors. Therefore, it is important to ensure our bot has access to your robots.txt file so we can abide by any restrictions it may include.

rrainn Spider Development

We also run a rrainn Spider Development system as well. More information about that can be found here.

Additional questions

We would be more than happy to answer and additional questions, resolve issues, or listen to feedback about rrainn Spider. Please email us at [email protected] and we will get back to you as soon as possible.