I would like to upload large image files to google cloud for machine learning purpose on RStudio.
Each image zip is around 4.7gb and it takes longer to unzip than to download. I would like to know is there a way that I can upload the image files to google cloud by using the current Kaggle url, such as: https://www.kaggle.com/c/5174/download/Images_1.zip
or https://www.kaggle.com/c/avito-duplicate-ads-detection/data
and extract them fast on VM RStudio for data analytics?
Best Answer
Have you installed RStudio in a Linux VM? If so you can ssh into your instance using command
sudo gcloud compute ssh <your-instance-name> --zone <your-instance-zone>
and then use wget from inside your instance to download the file:wget might get disconnected during the download but you can use options described in above link that will help you make the download successful, like the -t and -c options for trying the download more times or continue getting a partially-downloaded file, respectively.
After the file is downloaded you can use 7ZIP to unzip the file in the directory where it was downloaded to using the command:
7z e Images_1.zip
You can copy the file to a GCP bucket using the command:
gsutil cp Images_1 gs://<your-bucket-name>
If wget and 7zip are not installed in the VM you can install them as per instructions, wget and 7zip, as follows. This examples are for Ubuntu or Debian Linux VMs:
Just follow installation instructions.