r00ty

r00ty@kbin.life · 4 months ago

Didn’t have the link to hand. But a search turned this one up: https://reggiodigital.com/blog/nginx-rule-blocking-bad-bots/ it looks to be the same list, and you can see the ones I’ve added to the end of that list.

r00ty@kbin.life · 4 months ago

Hmm, I took an original list and added to it. You got a website I can check? If so I’ll happily remove. I don’t mind slow web crawlers at all.

r00ty@kbin.life · 4 months ago

So on my mbin instance, it’s on cloudflare. So I filter the AS numbers there. Don’t even reach my server.

On the sites that aren’t behind cloudflare. Yep it’s on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there’s a website there to leech if they change their tactics for example.

You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.

r00ty@kbin.life · 4 months ago

Yeah, I probably should look to see if there’s any good plugins that do this on some community submission basis. Because yes, it’s a pain to keep up with whatever trick they’re doing next.

And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.

r00ty@kbin.life · 4 months ago

If you’re running nginx I am using the following:

if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }

That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!

I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):

AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)

Since these guys run or have run bots that impersonate real browser agents.

There are various tools online to return prefix/ip lists for an autonomous system number.

I put both into a single file and include it into my web site config files.

EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.

r00ty@kbin.life · 4 months ago

The sun always shines on pc.

r00ty@kbin.life · 5 months ago

I did a routine upgrade on my mbin server, where I had an old version with changes I made myself.

Well turns out I upgraded something (probably redis) that broke symfony that broke everything.

So I had a fun afternoon upgrading to the latest mbin version. I mean I needed to anyway but my hand was forced.

Yep sometimes an innocent looking update will change your weekend plans.

Anyways, any reason not to use ssh?

r00ty@kbin.life · 6 months ago

I think it had its uses in the past, specifically if it had the memory backup to prevent full array rebuilds and cached data loss on power failure.

Also at the height of raid controller use (I would say 90s and 2000s) there probably was some compute savings by shifting the work to a dedicated controller.

In modern day, completely agree.

r00ty@kbin.life · edit-2 6 months ago

I’m sure I’ve seen paid software that will detect and read data from several popular hardware controllers. Maybe there’s something free that can do the same.

For the future, I’d say that with modern copy on write filesystems, so long as you don’t mind the long rebuild on power failures, software raid is fine for most people.

I found this, which seems to be someone trying to do something similar with a drive array built with an Intel raid controller

https://blog.bramp.net/post/2021/09/12/recovering-a-raid-5-intel-storage-matrix-on-linux-without-the-hardware/

Note, they are using drive images, you should be too.

r00ty@kbin.life · 6 months ago

The OP made clear it was a controller failure or entire system (I read hardware here) failure. Which does complicate things somewhat.

r00ty@kbin.life · 9 months ago

Sorry. I chose .local and I’m sticking to it.

r00ty@kbin.life · edit-2 11 months ago

I’m in the ntppool.org pool for the UK. It randomly assigns servers which could be any stratum really (but there is quality control on the time provided). I also have stratum 2 servers in .fi, and .fr (which are dedicated servers I also use for other things, rather than a raspberry pi).

r00ty@kbin.life · 11 months ago

No. A GPS (with PPS) hat. That counts as a stratum 0 time source, making the NTP server stratum 1.

r00ty@kbin.life · 11 months ago

Well I run an ntp stratum 1 server handling 2800 requests a second on average (3.6mbit/s total average traffic), and a flight radar24 reporting station, plus some other rarely used services.

The fan only comes on during boot, I’ve never heard it used in normal operation. Load averages 0.3-0.5. Most of that is Fr24. Chrony takes <5% of a single core usually.

It’s pretty capable.

r00ty@kbin.life · 1 year ago

Not sure, this wasn’t clear to me from their pricing page. There were 4 stars next to that item but the explanation for that didn’t elaborate on bulk retrieve.

I assumed there was some minimum number of operations, or it had to be the entire backup restored to count as bulk.

r00ty@kbin.life · 1 year ago

But isn’t that the point? You pay a low fee for inconvenient access to storage in the hope you never need it. If you have a drive failure you’d likely want to restore it all. In which case the bulk restore isn’t terrible in pricing and the other option is, losing your data.

I guess the question of whether this is a service for you is how often you expect a NAS (that likely has redundancy) to fail, be stolen, destroyed etc. I would expect it to be less often than once every 5 years. If the price to store 12TB for 5 years and then restore 12TB after 5 years is less than the storage on other providers, then that’s a win, right? The bigger thing to consider is whether you’re happy to wait for the data to become available. But for a backup of data you want back and can wait for it’s probably still good value. Using the 12TB example.

Backblaze, simple cost. $6x12 = $72/month which over a 5-year period would be $4320. Depending on whether upload was fast enough to incur some fees on the number of operations during backup and restore might push that up a bit. But not by any noticeable amount, I think.

For amazon glacier I priced up (I think correctly, their pricing is overly complicated) two modes. Flexible access and deep archive. The latter is probably suitable for a NAS backup. Although of course you can only really add to it, and not easily remove/adjust files. So over time, your total stored would likely exceed the amount you actually want to keep. Some complex “diff” techniques could probably be utilised here to minimise this waste.

Deep archive
12288 put requests @ $0.05 = $614.40
Storage 12288GB per month = $12.17 x 60 = $729.91
12288 get requests @ $0.0004 = $4.92
12288GB retrieval @ $0.0025 / GB x 12288 = $30.72 (if bulk possible)
12288GB retrieval @ $0.02 / GB x 12288 = $245.76 (if bulk not possible)

Total: $1379.95 / $1594.99

Flexible
12288 put requests @ $0.03 = $368.64
Storage 12288GB per month = $44.24 x 60 = $2654.21
12288 get requests @ $0.0004 = $4.92
12288GB retrieval @ $0.01 / GB x 12288 = $122.88

Total: $3150.65

In my mind, if you just want to push large files you’re storing on a high capacity NAS somewhere they can be restored on some rainy day sometime in the future, deep archive can work for you. I do wonder though, if they’re storing this stuff offline on tape or something similar, how they bring back all your data at once. But, that seems to me to be their problem and not the user’s.

Do let me know if I got any of the above wrong. This is just based on the tables on the S3 pricing site.

r00ty@kbin.life · 1 year ago

So there’s three problems you are very likely to encounter.

Most providers now almost certainly filter their egress for netblocks under their control to prevent ip spoofing. So it’s likely the packets would never make it out at all.

2: if it does work the return path would be over the normal Internet route and not via the vpn. Only the sent packets would go via the vpn host.

3: if the client is behind nat the router will not recognise the response packets as belonging to an open connection and will drop them.

I’m really not sure what your intention is.

r00ty@kbin.life · 1 year ago

Yeah, not sure how many isps block it. They didn’t used to 10 years or so ago. I used to block unknown ips at my egress.

But they should, and I’m hoping they do now.

I’m also not too sure what the point would be for the OP. Even if their isp allows the ip spoofing the response would take the normal route back to the vpn client.

r00ty@kbin.life · 2 years ago

I’m not using lemmy. But I was thinking of making a process to periodically scan the object storage and check for a reference to a post, comment etc and if none are found delete it. In most cases the images are deleted but sometimes they don’t seem to be.

Probably lemmy could have a similar process created.

r00ty@kbin.life · 2 years ago

Well mobile data is very different. With fibre optic you can generally keep provisioning more cables and a single cable already carries a huge amount already.

Radio has an absolute efficiency limit for the bandwidth of a signal and we’re pretty damn close to that now.

5g uses wider bandwidth channels, with more cells closer together and uses things like beamforming. But there’s still always going to be an upper limit that is considerably lower than fibre.

This is why they likely want to discourage 5g becoming a full alternative to wired, because there’s just not the capacity to do it on the same scale.

r00ty

Deep archive 12288 put requests @ $0.05 = $614.40 Storage 12288GB per month = $12.17 x 60 = $729.91 12288 get requests @ $0.0004 = $4.92 12288GB retrieval @ $0.0025 / GB x 12288 = $30.72 (if bulk possible) 12288GB retrieval @ $0.02 / GB x 12288 = $245.76 (if bulk not possible)

Flexible 12288 put requests @ $0.03 = $368.64 Storage 12288GB per month = $44.24 x 60 = $2654.21 12288 get requests @ $0.0004 = $4.92 12288GB retrieval @ $0.01 / GB x 12288 = $122.88

Deep archive
12288 put requests @ $0.05 = $614.40
Storage 12288GB per month = $12.17 x 60 = $729.91
12288 get requests @ $0.0004 = $4.92
12288GB retrieval @ $0.0025 / GB x 12288 = $30.72 (if bulk possible)
12288GB retrieval @ $0.02 / GB x 12288 = $245.76 (if bulk not possible)

Flexible
12288 put requests @ $0.03 = $368.64
Storage 12288GB per month = $44.24 x 60 = $2654.21
12288 get requests @ $0.0004 = $4.92
12288GB retrieval @ $0.01 / GB x 12288 = $122.88