I’m the administrator of kbin.life, a general purpose/tech orientated kbin instance.

  • 0 Posts
  • 26 Comments
Joined 2 years ago
cake
Cake day: June 29th, 2023

help-circle


  • So on my mbin instance, it’s on cloudflare. So I filter the AS numbers there. Don’t even reach my server.

    On the sites that aren’t behind cloudflare. Yep it’s on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there’s a website there to leech if they change their tactics for example.

    You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.


  • Yeah, I probably should look to see if there’s any good plugins that do this on some community submission basis. Because yes, it’s a pain to keep up with whatever trick they’re doing next.

    And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.


  • If you’re running nginx I am using the following:

    if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }

    That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!

    I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):

    AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)

    Since these guys run or have run bots that impersonate real browser agents.

    There are various tools online to return prefix/ip lists for an autonomous system number.

    I put both into a single file and include it into my web site config files.

    EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.



  • I did a routine upgrade on my mbin server, where I had an old version with changes I made myself.

    Well turns out I upgraded something (probably redis) that broke symfony that broke everything.

    So I had a fun afternoon upgrading to the latest mbin version. I mean I needed to anyway but my hand was forced.

    Yep sometimes an innocent looking update will change your weekend plans.

    Anyways, any reason not to use ssh?








  • Well I run an ntp stratum 1 server handling 2800 requests a second on average (3.6mbit/s total average traffic), and a flight radar24 reporting station, plus some other rarely used services.

    The fan only comes on during boot, I’ve never heard it used in normal operation. Load averages 0.3-0.5. Most of that is Fr24. Chrony takes <5% of a single core usually.

    It’s pretty capable.



  • But isn’t that the point? You pay a low fee for inconvenient access to storage in the hope you never need it. If you have a drive failure you’d likely want to restore it all. In which case the bulk restore isn’t terrible in pricing and the other option is, losing your data.

    I guess the question of whether this is a service for you is how often you expect a NAS (that likely has redundancy) to fail, be stolen, destroyed etc. I would expect it to be less often than once every 5 years. If the price to store 12TB for 5 years and then restore 12TB after 5 years is less than the storage on other providers, then that’s a win, right? The bigger thing to consider is whether you’re happy to wait for the data to become available. But for a backup of data you want back and can wait for it’s probably still good value. Using the 12TB example.

    Backblaze, simple cost. $6x12 = $72/month which over a 5-year period would be $4320. Depending on whether upload was fast enough to incur some fees on the number of operations during backup and restore might push that up a bit. But not by any noticeable amount, I think.

    For amazon glacier I priced up (I think correctly, their pricing is overly complicated) two modes. Flexible access and deep archive. The latter is probably suitable for a NAS backup. Although of course you can only really add to it, and not easily remove/adjust files. So over time, your total stored would likely exceed the amount you actually want to keep. Some complex “diff” techniques could probably be utilised here to minimise this waste.

    Deep archive
    12288 put requests @ $0.05 = $614.40
    Storage 12288GB per month = $12.17 x 60 = $729.91
    12288 get requests @ $0.0004 = $4.92
    12288GB retrieval @ $0.0025 / GB x 12288 = $30.72 (if bulk possible)
    12288GB retrieval @ $0.02 / GB x 12288 = $245.76 (if bulk not possible)

    Total: $1379.95 / $1594.99

    Flexible
    12288 put requests @ $0.03 = $368.64
    Storage 12288GB per month = $44.24 x 60 = $2654.21
    12288 get requests @ $0.0004 = $4.92
    12288GB retrieval @ $0.01 / GB x 12288 = $122.88

    Total: $3150.65

    In my mind, if you just want to push large files you’re storing on a high capacity NAS somewhere they can be restored on some rainy day sometime in the future, deep archive can work for you. I do wonder though, if they’re storing this stuff offline on tape or something similar, how they bring back all your data at once. But, that seems to me to be their problem and not the user’s.

    Do let me know if I got any of the above wrong. This is just based on the tables on the S3 pricing site.


  • So there’s three problems you are very likely to encounter.

    1. Most providers now almost certainly filter their egress for netblocks under their control to prevent ip spoofing. So it’s likely the packets would never make it out at all.

    2: if it does work the return path would be over the normal Internet route and not via the vpn. Only the sent packets would go via the vpn host.

    3: if the client is behind nat the router will not recognise the response packets as belonging to an open connection and will drop them.

    I’m really not sure what your intention is.




  • Well mobile data is very different. With fibre optic you can generally keep provisioning more cables and a single cable already carries a huge amount already.

    Radio has an absolute efficiency limit for the bandwidth of a signal and we’re pretty damn close to that now.

    5g uses wider bandwidth channels, with more cells closer together and uses things like beamforming. But there’s still always going to be an upper limit that is considerably lower than fibre.

    This is why they likely want to discourage 5g becoming a full alternative to wired, because there’s just not the capacity to do it on the same scale.