The challenge

For one of the Magento shops we host, I wanted to create a clone of the current production code to create a local development environment. I've started from the command line a sftp job to copy all files & folders to my development machine. During the execution of sftp job I found out that would take days to download all the folders & files. Although I know that Magento has a lot of folders and files by default & I had already cleaned the cache, generated and var folders before downloading and yes I know I've could have created first an archive, one folder in particularly was taking an extremely long time to download. The output of the sftp job showed it was downloading PNG images named with the following pattern d139063a685bb37283208f6d9df6cb68.png. After 4 hours of activity on just this folder alone, I decided to kill the job and to start investigating why it was taking so long.

After killing the job I quickly found out that the folder was located here httpdocs/pub/media/captcha/base and it had captcha in the name. A light bulb went on in my head and clarified the contents of the folder, but not yet why there where so many files.

Running $ ls -1 | wc -l showed that the folder incorporated 11.998.456 items. Wowzers!!!

The average files size of these images is 8.3kB ($ wc -c * devided by $ ls | wc -l). At the point of identification the captcha folder was around ~99,5GB big. Our webservers have plenty of SSD space and that is the main reason why it didn't popup in the server logs yet.

Googling on this topic it became clear that Magento 2.x stores generated images of captcha's to disk and apperently these images are not automatically removed after they expire. Time to cleanup and fix the unwanted situation.

The problem of rm and to many files.

To clean up the files in the folder I executed the normal command $ sudo -u magento_user rm -rf *, but this outputted immediately the following feedback:

/bin/rm: cannot execute [Argument list too long]

The captcha files are owned by the magento user and therefore we need to execute the command as magento_user ($ sudo -u magento_user command).

The reason this occurs is because bash actually expands the asterisk to every matching file, producing a very long command line. This expanded command line is too long. This not a user issue, but a system issue and is related to execve and ARG_MAX constant. There is more than enough of documentation about that (see man execve, debian's wiki).

Basically, the expansion produces a command (with its parameters) that exceeds the ARG_MAX limit. On kernel 4.4.0-131, the limit is to 2097152 Bytes. To validate on your system, you can get see value by executing:

$ getconf ARG_MAX

The solution to remove all the captcha files from the folder without removing the folder and its permissions, is run the following command:

$ sudo -u magento_user find . -maxdepth 1 -name '*.png' -delete

We could have easily removed the folder of course, but I didn't want to break the running production system. Removing of the files is saver then removing a folder in use. The folder is now empty, but this doesn't solve the main challenge.

Enabling the cronjob

Looking into the log files I found out that the cronjob(s) where not running. Although I was pretty sure the cronjobs did run in the past and there was even right now no complaints in the admin tool, the logfiles told for sure differently.

In the magento.cron.log there was a repeated message telling us the following:

Notice: Undefined index: jobs in /usr/share/nginx/ on line 39

cron:run [--group GROUP] [--bootstrap BOOTSTRAP]

And in the update.log it reported the following line:

setup-cron.ERROR: Database 'magento_database' does not exist or specified database server user does not have privileges to access this database.

Looking into the database I've found out quickly that in this case the issue was casued by the Magento 1.x to 2.2.x migration. Two of the cron jobs migrated from Magento 1, in core_config_data where not converted to the new format expected by Magento 2. Magento 2 expects the format to be crontab/{groupname}/jobs/{code}, while the imported ones miss the group name (e.g. default). If you've get the same messages like I get and you did migrate from Magento 1 to Magento 2, you might want to delete the wrong entries in core_config_data, or update them to map the new cron jobs. I decided to delete this 2 records from the MySQL database.

To fix this incorrect migrated configuration rows in the database we need to execute the following steps:

$ mysql -u root -p
Enter password:
mysql> USE mage_database;
mysql> DELETE FROM core_config_data WHERE path LIKE 'crontab/jobs%';
Query OK, 2 rows affected (0.01 sec)
mysql> \q

After this step inmediatelly the cronjobs are sucessfully executed again deliviering tons of e-mails and doing all the other jobs the should. Most of the cronjobs in Mangento run every minute. The folder with all the expired captcha images was also cleaned up and listed now only 8 items. The last item of the log output below here was the one we are searching for.

[2018-08-19 13:20:13] main.INFO: Cron Job outdated_authentication_failures_cleanup is run [] []
[2018-08-19 13:20:13] main.INFO: Cron Job outdated_authentication_failures_cleanup is successfully finished. Statistics: {"sum":0.0050489902496338,"count":1,"realmem":0,"emalloc":1128,"realmem_start":138412032,"emalloc_start":134830360} [] []
[2018-08-19 13:20:13] main.INFO: Cron Job newsletter_send_all is run [] []
[2018-08-19 13:20:13] main.INFO: Cron Job newsletter_send_all is successfully finished. Statistics: {"sum":0.039923906326294,"count":1,"realmem":0,"emalloc":285336,"realmem_start":138412032,"emalloc_start":134838152} [] []
[2018-08-19 13:20:14] main.INFO: Cron Job magento_newrelicreporting_cron is run [] []
[2018-08-19 13:20:14] main.INFO: Cron Job magento_newrelicreporting_cron is successfully finished. Statistics: {"sum":0.00068521499633789,"count":1,"realmem":0,"emalloc":752,"realmem_start":138412032,"emalloc_start":135516320} [] []
[2018-08-19 13:20:14] main.INFO: Cron Job captcha_delete_expired_images is run [] []

Why this happend?

We should have noticted this before but we didn't and this is the biggest learning of the day. We did upgrade the Mangeto shop from 2.2.3 to 2.2.5 last week and this is probably when the cronjobs started to fail. We've added new alerts on the logfiles and the next time this happens we get alerts on our dashboards that are informing us about the state of the webshop.

Thanks for reading.