buyabag 发表于 2021-8-5 20:45:10

apache怎么屏蔽垃圾爬虫

一批站群,最近被各种垃圾爬虫爬的cpu和磁盘io居高不下,想问问大佬,apache怎么用有效屏蔽这些垃圾爬虫。

找了几个httpd.conf屏蔽bot UA的方法,貌似不起作用~

loquat 发表于 2021-8-5 21:09:19

通过Apache.htaccess屏蔽垃圾蜘蛛

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} "^$|^-$|MSNbot|Webdup|AcoonBot|SemrushBot|CrawlDaddy|DotBot|Applebot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|DingTalkBot|DuckDuckBot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Barkrowler|SeznamBot|Jorgee|CCBot|SWEBot|PetalBot|spbot|TurnitinBot-Agent|mail.RU|curl|perl|Python|Wget|Xenu|ZmEu|EasouSpider|YYSpider|python-requests|oBot|MauiBot"
RewriteRule !(^robots\.txt$) http://en.wikipedia.org/wiki/Robots_exclusion_standard



Nginx 禁止



    #禁止Scrapy等工具的抓取
    if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
      return 403;
    }
    #禁止指定UA及UA为空的访问
    if ($http_user_agent ~ "FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|LinkpadBot|Ezooms|^$" )
    {
      return 403;
    }
    #禁止非GET|HEAD|POST方式的抓取
    if ($request_method !~ ^(GET|HEAD|POST)$) {
      return 403;
    }

buyabag 发表于 2021-8-5 21:13:19

loquat 发表于 2021-8-5 21:09
通过Apache.htaccess屏蔽垃圾蜘蛛

RewriteEngine on


回复的好快啊,一会试试

隔壁老王 发表于 2021-8-5 21:30:16

留名备用

河小马 发表于 2021-8-6 09:20:17

很多爬虫不尊敬robots.txt的

所以直接按照UA来屏蔽就可以了

binge2018 发表于 2021-12-26 13:19:17

好好的英文站被360,阿里云天天爬,要不直接屏蔽国内ip?
页: [1]
查看完整版本: apache怎么屏蔽垃圾爬虫