rrainn Spider is a web crawler used for multiple rrainn projects. These projects include (but are not limited to), contextual search experiences, web search engines, and more. You might see rrainn Spider make requests to your site to gather information about it's contents.
Identifying rrainn Spider
All rrainn Spider requests will contain a
User-Agent header with the value set to a string similar to
rrainnSpider/1.0.0 (+https://rrainn.com/spider) (the
1.0.0 segment will change depending on what rrainn Spider version accessed your site).
Additionally, rrainn Spider has reverse DNS setup for every IP address that rrainn Spider will make requests from. All of these reverse DNS entries will end with
spider.rrainn.com. For example, you can use the host command to check to ensure the IP address comes from a
$ host 18.104.22.168
22.214.171.124.in-addr.arpa domain name pointer 165-232-132-144.spider.rrainn.com.
Since anyone can point to a
spider.rrainn.com domain name for an IP address reverse DNS query, it is important to verify that the DNS record also points to the IP address:
$ host 165-232-132-144.spider.rrainn.com
165-232-132-144.spider.rrainn.com has address 126.96.36.199
If the reverse DNS entry does not end with
spider.rrainn.com or the
spider.rrainn.com domain name does not point to IP address, the request was not made from rrainn Spider.
Improve crawl content
rrainn Spider looks for many things when crawling your site. One of which is a
script[type=application/ld+json] object. Creating that object with a
@context will help rrainn Spider gather data about your site. Additionally adding a
Sitemap to your
robots.txt file will greatly assist rrainn Spider in crawling your site.
Limit rrainn Spider request frequency
rrainn Spider takes multiple steps to help ensure your servers aren't overloaded with requests. However, if you would like to further limit the frequency of requests made by rrainn Spider, you can use the
Crawl-Delay option. Within your
robots.txt file, rrainn Spider will look for the
rrainnSpider User-Agent, and fall back to
* User-Agent if rrainnSpider does not exist. rrainn Spider will interpret this value as the number of seconds rrainn Spider should wait before making subsequent requests.
Additionally, rrainn Spider will honor all 429 HTTP status code responses. In the event you do not send a
Retry-After header, rrainn Spider will default to not crawling your site again for 24 hours.
Prevent rrainn Spider from crawling my site
rrainn Spider respects the
robots.txt file. All
Disallow commands will be respected by rrainn Spider (with the exception of the robots.txt file itself). rrainn Spider use the
rrainnSpider User-Agent, and fall back to
* User-Agent if rrainnSpider does not exist in your
robots.txt file. In the future we plan to add support to prevent crawling based on
meta[name=robots] tags, however this functionality is not currently part of rrainn Spider.
Our spider will also respect all
a[rel=nofollow] links. We will not add these links to our system to be crawled. However, if other places link to the resulting URL, we will crawl the URL, but not attribute it to the site that has the
Finally, you can add a
meta[name=robots] tag (the former taking priority) with a
content attribute to add additional restrictions. We support the following commands for that value:
all- no restrictions
nofollow- will not follow any links on the page
noindex- will not index the page to be linked to, or suggested in rrainn products
none- same as
- We work hard to ensure that rrainn Spider respects your site, hosting infrastructure, and overall internet community. If you feel like rrainn Spider is doing something malicious please email us at [email protected], so we can work to resolve your concerns.
- rrainn Spider may cache your
robots.txtfile for up to 60 minutes. In the future we plan to add a system to enable website owners/maintainers to force purge that cache to force rrainn Spider to re-retrieve your
robots.txtfile before accessing your site again, however this functionality is not currently available. If you need to force purge your
robots.txtfile from our cache, please email [email protected].
- An inaccessible
robots.txtfile is treated as allow all with no restrictions. This can either be due to an error status code returned, timeout, or other network errors. Therefore, it is important to ensure our bot has access to your
robots.txtfile so we can abide by any restrictions it may include.
rrainn Spider Development
We also run a rrainn Spider Development system as well. More information about that can be found here.
We would be more than happy to answer and additional questions, resolve issues, or listen to feedback about rrainn Spider. Please email us at [email protected] and we will get back to you as soon as possible.