Crawl Task defines how our robot crawls your website. You can define as many Crawl Tasks as your current plan allows. Each task can have only one running process. We create a "default" task for you as a starter template.
How and Why to Use Tasks
With tasks you can define different crawl types for different sections of your website. For example you can have a huge product listing of several thousands of products. You can set up separate crawl task and run it less often then common website sections.
Other scenario include testing of development version of the website or feature which is deployed with restricted access. You can add HTTP Auth user and password to allow crawler access restricted area.
Each task can be configured, run and scheduled individually.
Each task can be configured using the following parameters:
- Name of the task. It is used in reports and notification emails. Use it to distinguish different crawl tasks
- Start URL
- URL of the start page of the crawl. Robot first loads this page and follows links found on the page.
- Link Following RegEx
- Regular expression to match for discovered URLs. If this regex is set crawler only follows matching URLs.
- Exclude Link RegEx
- Regular expression to match for discovered URLs. If this regex is set crawler will not follows matching URLs.
- Max Depth
- Sets maximum depth for links wich will be crawled. Each slash in URL adds one level of depth. This parameter is relative to the current URL. So for example if you set start URL http://example.com/one/index.html and you set Max Depth parameter to 1 then only links like /one/ /two/ etc. will be crawled and no links like /one/two/
- Rate Limit
- This sets maximum number of requests per second from our crawler to your website with this task. Our crawler can use multiple threads to connect to your website. Each thread requests a page and waits until response is received. If your website is very fast or you want to limit load from our crawler you can set the limit to desired value. For example setting to 5 means that all threads of the crawler will not make more then 5 requests in any second. Lowest possible value is 1 request per second. To obtain maximum performance set to 0 - that means "no limit". We will request new pages as fast as your website can handle requests. Maximum value is subject to your current plan.
- Sets maximum number of concurrent connections to your website. Subject to your current plan limits. The more connections the faster we can crawl your website. But more connections means higher load to your server. Please ensure you can handle additional load on specified RPS rate.
- Max Requests
- Maximum number of requests per one crawl process. Each crawl process starts from the same start URL. It is recommended to test your new settings with small number of requests and if everything is ok then increase this number. Set to 0 to crawl the whole website. Please note that crawling the whole website will add outgoing traffic depending on your website size and it can be billed by your hosting provider.
- Custom User-Agent
- Our robot default user agent is
seocharger-robot/X.X. If your website uses user agent field to change behavior for different UAs you can specify desired UA here.
- Check for Broken Resources
We can check links, images and linked elements like styles and scripts while crawling. If element cannot be accessed we add corresponding SEO issue to the website report. Requests made while checking for broken resources do not count towards total number of requests made.
There are the following options available: Disabled - check is not performed. Local resources only - then resources on external websites will not be checked. This is useful if you have a lot of external resources and those resources are blocked in robots.txt and cannot be accessed by our robot. Local and external resources - all resources will be checked.
Each resource is checked only once per crawl. So if you include css file on every page of your website we will not request it thousands times.
- Basic Auth User and Password
- Use this fields to provide access for our crawler to protected pages of your website. For example you can use it to crawl staging server or development version of a new feature.
- Send email notification
- Bigger crawls can take significant time to finish. If you check this checkbox then we will send you an email notification when crawl task finishes. You can track crawl progress on the website dashboard or at Crawl Task page in the process list. Results are loaded during crawl and you can start looking and crawler pages while crawl is in progress.
If you need help configuring your crawl or you need some feature yet not implemented - contact support.