robots.txt — How to control the crawling of your website

Content of the article:

Erstellt am:

03.08.2025

Geändert am:

19.09.2025

The robots.txt is a text file in the root directory of a website. It gives search engine crawlers clues as to which areas of the page should be indexed and which should be ignored. The aim is to control crawling activity and avoid unnecessary traffic or duplicate content.

Why is robots.txt important?

Search engines like Google regularly crawl websites to capture their content and save it in the index. The robots.txt helps to set priorities:

Protect performance: Large websites can specifically control crawlers to avoid server load.
Exclude unnecessary pages: For example, login pages, internal search results, or test environments.
Avoid duplicate content: Parameter URLs or print versions can be excluded.

Important: The robots.txt prevents notthat pages appear in the index — it just stops the crawler. If a URL is known, for example, from external links, it can still be listed — but without Page content.

Building a robots.txt

A robots.txt consists of so-called user agents and directives. A simple example:

User-agent: *
Disallow: /intern/ 
Allow: /intern/übersicht.html

explanation:

User-agent: * affects all crawlers.
Disallow prohibits crawling of the /intern/ directory.
Allow makes a specific subpage of it accessible again.

Best Practices

robots.txt always in the root directory Drop (example.com/robots.txt)
Provide a sitemap so crawlers know which pages for the indexation The following are envisaged:
Site map: https://example.com/sitemap.xml
Don't “hide” sensitive content via robots.txt — it is publicly available.
Don't use for SEO-critical pages: If you want to specifically remove a page from the index, you should use the noindex or X-Robots tag meta tag instead.

Create robots.txt — step by step

Even without developer knowledge, you can create a simple robots.txt yourself. Here's how you go about it:

1. Create text file

Open a simple text editor (e.g. Notepad, VS Code) and create a new file. Save it under the name robots.txt — exactly as it is, without an additional extension.

2. Target crawlers

Determine which crawlers you want to target with the file. User-agent: * means: The rules apply to all search engines.

3. Exclude paths

Specify which areas of your site should not be crawled:

Disallow: /intern/

4. Release individual pages (optional)

If you still want to unblock certain pages within an excluded directory, use Allow:

Allow: /intern/übersicht.html

5. Upload a file

Load the finished robots.txt into Your domain's main directory (root level), e.g.

https://example.com/robots.txt

Only there is it recognized by search engines.

6. Link to a sitemap (recommended)

At the very bottom of the file, you can also enter your sitemap — this also helps search engines:

Site map: https://example.com/sitemap.xml

Note for Webflow users: In the project settings, Webflow offers under SEO → Custom robots.txt a separate field for this file. You can paste your content directly there — no need to upload it via FTP.

conclusion

The robots.txt is a simple but important tool for SEO and crawling control. Used correctly, it improves the efficiency of search engines while protecting sensitive or irrelevant areas of a website.