rrainn Spider
General
rrainn Spider is a web crawler used for multiple rrainn projects. These projects include (but are not limited to), contextual search experiences, web search engines, and more. You might see rrainn Spider make requests to your site to gather information about it's contents.
Identifying rrainn Spider
All rrainn Spider requests will contain a User-Agent
header with the value set to a string similar to rrainnSpider/1.0.0 (+https://rrainn.com/spider)
(the 1.0.0
segment will change depending on what rrainn Spider version accessed your site).
Additionally, rrainn Spider has reverse DNS setup for every IP address that rrainn Spider will make requests from. All of these reverse DNS entries will end with spider.rrainn.com
. For example, you can use the host command to check to ensure the IP address comes from a spider.rrainn.com
subdomain:
$ host 165.232.132.144
144.132.232.165.in-addr.arpa domain name pointer 165-232-132-144.spider.rrainn.com.
Since anyone can point to a spider.rrainn.com
domain name for an IP address reverse DNS query, it is important to verify that the DNS record also points to the IP address:
$ host 165-232-132-144.spider.rrainn.com
165-232-132-144.spider.rrainn.com has address 165.232.132.144
If the reverse DNS entry does not end with spider.rrainn.com
or the spider.rrainn.com
domain name does not point to IP address, the request was not made from rrainn Spider.
Improve crawl content
rrainn Spider looks for many things when crawling your site. One of which is a script[type=application/ld+json]
object. Creating that object with a schema.org
@context
will help rrainn Spider gather data about your site. Additionally adding a Sitemap
to your robots.txt
file will greatly assist rrainn Spider in crawling your site.
Limit rrainn Spider request frequency
rrainn Spider takes multiple steps to help ensure your servers aren't overloaded with requests. However, if you would like to further limit the frequency of requests made by rrainn Spider, you can use the robots.txt
Crawl-Delay
option. Within your robots.txt
file, rrainn Spider will look for the rrainnSpider
User-Agent, and fall back to *
User-Agent if rrainnSpider does not exist. rrainn Spider will interpret this value as the number of seconds rrainn Spider should wait before making subsequent requests.
Additionally, rrainn Spider will honor all 429 HTTP status code responses. In the event you do not send a Retry-After
header, rrainn Spider will default to not crawling your site again for 24 hours.
Prevent rrainn Spider from crawling my site
rrainn Spider respects the robots.txt
file. All Allow
and Disallow
commands will be respected by rrainn Spider (with the exception of the robots.txt file itself). rrainn Spider use the rrainnSpider
User-Agent, and fall back to *
User-Agent if rrainnSpider does not exist in your robots.txt
file. In the future we plan to add support to prevent crawling based on meta[name=robots]
tags, however this functionality is not currently part of rrainn Spider.
Our spider will also respect all a[rel=nofollow]
links. We will not add these links to our system to be crawled. However, if other places link to the resulting URL, we will crawl the URL, but not attribute it to the site that has the a[rel=nofollow]
tag.
Finally, you can add a meta[name=rrainnSpider]
or meta[name=robots]
tag (the former taking priority) with a content
attribute to add additional restrictions. We support the following commands for that value:
all
- no restrictionsnofollow
- will not follow any links on the pagenoindex
- will not index the page to be linked to, or suggested in rrainn productsnone
- same as nofollow, noindex
Other information
- We work hard to ensure that rrainn Spider respects your site, hosting infrastructure, and overall internet community. If you feel like rrainn Spider is doing something malicious please email us at [email protected], so we can work to resolve your concerns.
- rrainn Spider may cache your
robots.txt
file for up to 60 minutes. In the future we plan to add a system to enable website owners/maintainers to force purge that cache to force rrainn Spider to re-retrieve your robots.txt
file before accessing your site again, however this functionality is not currently available. If you need to force purge your robots.txt
file from our cache, please email [email protected]. - An inaccessible
robots.txt
file is treated as allow all with no restrictions. This can either be due to an error status code returned, timeout, or other network errors. Therefore, it is important to ensure our bot has access to your robots.txt
file so we can abide by any restrictions it may include.
rrainn Spider Development
We also run a rrainn Spider Development system as well. More information about that can be found here.
Additional questions
We would be more than happy to answer and additional questions, resolve issues, or listen to feedback about rrainn Spider. Please email us at [email protected] and we will get back to you as soon as possible.