robots.txt — How to control the crawling of your website
The robots.txt is a text file in the root directory of a website. It gives search engine crawlers clues as to which areas of the page should be indexed and which should be ignored. The aim is to control crawling activity and avoid unnecessary traffic or duplicate content.
Why is robots.txt important?
Search engines like Google regularly crawl websites to capture their content and save it in the index. The robots.txt helps to set priorities:
- Protect performance: Large websites can specifically control crawlers to avoid server load.
- Exclude unnecessary pages: For example, login pages, internal search results, or test environments.
- Avoid duplicate content: Parameter URLs or print versions can be excluded.
Important: The robots.txt prevents notthat pages appear in the index — it just stops the crawler. If a URL is known, for example, from external links, it can still be listed — but without Page content.
Building a robots.txt
A robots.txt consists of so-called user agents and directives. A simple example:
User-agent: *
Disallow: /intern/
Allow: /intern/übersicht.html
explanation:
- User-agent: * affects all crawlers.
- Disallow prohibits crawling of the /intern/ directory.
- Allow makes a specific subpage of it accessible again.
Best Practices
- robots.txt always in the root directory Drop (example.com/robots.txt)
- Provide a site mapso crawlers know which pages to index:
Site map: https://example.com/sitemap.xml - Don't “hide” sensitive content via robots.txt — it is publicly available.
- Don't use for SEO-critical pages: If you want to specifically remove a page from the index, you should use the noindex or X-Robots tag meta tag instead.
Create robots.txt — step by step
Even without developer knowledge, you can use a simple robots.txt
Create it yourself. Here's how you go about it:
1. Create text file
Open a simple text editor (e.g. Notepad, VS Code) and create a new file. Save it under the name robots.txt — exactly as it is, without an additional extension.
2. Target crawlers
Determine which crawlers you want to target with the file. User-agent: * means: The rules apply to all search engines.
3. Exclude paths
Specify which areas of your site should not be crawled:
Disallow: /intern/
4. Release individual pages (optional)
If you still want to unblock certain pages within an excluded directory, use Allow:
Allow: /intern/übersicht.html
5. Upload a file
Load the finished robots.txt into Your domain's main directory (root level), e.g.
https://example.com/robots.txt
Only there is it recognized by search engines.
6. Link to a sitemap (recommended)
At the very bottom of the file, you can also enter your sitemap — this also helps search engines:
Site map: https://example.com/sitemap.xml
Note for Webflow users:
In the project settings, Webflow offers under SEO → Custom robots.txt a separate field for this file. You can paste your content directly there — no need to upload it via FTP.
conclusion
The robots.txt is a simple but important tool for SEO and crawling control. Used correctly, it improves the efficiency of search engines while protecting sensitive or irrelevant areas of a website.