如何禁止不遵守robots.txt的机器人？-Java 学习之路

我最近改变了我的robots.txt以禁止机器人进行昂贵的搜索API查询 . 它们现在被允许所有其他页面，除了 /q?... 这是一个搜索API查询并且价格昂贵 .

User-agent: *
Disallow: /q?

Sitemap: /sitemap.xml.gz

现在我仍然在我的日志中获取机器人 . 它是google还是只是“googlebot兼容”？我如何完全禁止机器人/ q？

2014-10-18 21:04:23.474 /q?query=category%3D5030%20and%20cityID%3D4698187&o=4 200 261ms 7kb Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) module=default version=disallow
66.249.79.28 - - [18/Oct/2014:12:04:23 -0700] "GET /q?query=category%3D5030%20and%20cityID%3D4698187&o=4 HTTP/1.1" 200 8005 - "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  ms=261 cpu_ms=108 cpm_usd=0.050895 app_engine_release=1.9.13 instance=00c61b117cdfd20321977d865dd08cef54e2fa

我是否可以根据我的请求处理程序中的http标头或我的 dos.yaml 将特定机器人列入黑名单，如果robots.txt不能这样做的话？当我运行此寻找比赛时，最后2小时有50场比赛：

path:/q.* useragent:.*Googlebot.*

日志行看起来像这样，看起来像googlebot：

2014-10-19 00:37:34.449 /q?query=category%3D1030%20and%20cityID%3D4752198&o=18 200 138ms 7kb Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) module=default version=disallow 66.249.79.102 - - [18/Oct/2014:15:37:34 -0700] "GET /q?query=category%3D1030%20and%20cityID%3D4752198&o=18 HTTP/1.1" 200 7965 - "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "www.classifiedsmarket.appspot.com" ms=138 cpu_ms=64 cpm_usd=0.050890 app_engine_release=1.9.13 instance=00c61b117c781458f46764c359368330c7d7fdc4

1 回答

1

是的，每个访客/机器人都声称是 Googlebot/2.1 （通过更改User-Agent header） .

您可以使用反向DNS查找verify that it was the real Googlebot .

根据您日志中的IP，它似乎是真正的机器人 . 你的robots.txt也是正确的 . 因此，Google认识到新规则应该只是时间问题，之后所有请求都应该停止 .

不尊重robots.txt的机器人当然可以阻止访问资源，但是（根据您识别机器人的标准）这也有可能阻止人类访问者 .

回复于 2024-05-14T07:46:47+08:00

如何禁止不遵守robots.txt的机器人？

1 回答

相关问题