hosted services
hungry agents
For a while I've noticed crawlers consuming large volumes of dynamic pages for LLMs and whatever the current language fad is.
I don't consent, the robots file gets ignored, so best to give data, they seem to want it. How about send a few KB that expands to GB's, hopefully that'll slow down the crawler as their memory is finite, just like the server resources that they're consuming.
Use your normal page template, if you have one, the top of the page should be called head.html, the bottom should be tail.html.
(cat head.html; dd if=/dev/zero bs=1024 count=500000000; cat tail.html ) | brotli -9 >file.brotli
(cat head.html; dd if=/dev/zero bs=1024 count=50000000; cat tail.html ) | gzip -9 >file.br
Copy file.{gz,br} to your document root.
How do we send the compressed data in a way that tells the receiver that it's compressed?
Fortunately, there's a header Content-Encoding, that you can use to say that the data is in a compressed format. Obviously you don't want to read in all the data on every load, so tell this file on the disk is compressed, and that's what his page load will be.
Note: You may also like mod-gzip-disk which helps skip that compression stage at delivery for static content.
We can't send this though if the robots.txt is the request though:
RewriteCond %{REQUEST_URI} ^/(wp-login.php|wp|cgi-bin/luci)
RewriteRule ^ - [E=trash:1]
We know that we don't have these URIs present, so we can 'trash' them
straight away. This will set the environment trash
, which we use later
on.
RewriteCond %{REQUEST_URI} !^/robots.txt
RewriteCond %{HTTP_USER_AGENT} "^.*(Yandex|MJ12bot).*" [NC]
RewriteRule ^ - [E=trash:1]
We do want the robots.txt to be available, we want the bots to go away, and the robots.txt says so.
RewriteCond %{ENV:trash} 1
RewriteCond %{HTTP:Accept-Encoding} "br" [NC]
RewriteRule ^ - [E=trash-br:1]
If we've set trash
and brotli is in the accept header, great, we'll set trash-br
for later.
RewriteCond %{ENV:trash} 1
RewriteCond %{ENV:trash-br} 1
RewriteRule ^ /file.brotli [L,E=no-gzip,E=trash-br:1]
RewriteCond %{ENV:trash} 1
RewriteRule ^ /file.gz [L,E=no-gzip,E=trash-gz:1]
In the above two, we send one of the compressed files, if trash-br
was
set, the brotli compressed file, otherwise, we'll make the wild
assumption that gzip is supported and just send it.
Header set Content-Encoding "br" env=trash-br
Header set Content-Encoding "gzip" env=trash-gz
Header set Content-Type "text/html" env=trash
The above three lines set content headers. If trash-br
was set, then
Content-Encoding: br
, if trash-gz
then Content-Encoding: gzip
, if
trash
then Content-type: text/html
.
gzip doesn't compress down quite as well as brotli for repeating information like this, that's why the zero padding is greater for brotli than gzip.
If I were on the other end of this, I think I'd try and process it as a stream, rather than put it on disk/RAM/DB - still has to walk through it to some extent, even if giving up after N bytes of non alpha.