Upload large online files to google cloud

google-app-enginegoogle-cloud-platformgoogle-cloud-storagegoogle-compute-enginegoogle-kubernetes-engine

I would like to upload large image files to google cloud for machine learning purpose on RStudio.

Each image zip is around 4.7gb and it takes longer to unzip than to download. I would like to know is there a way that I can upload the image files to google cloud by using the current Kaggle url, such as: https://www.kaggle.com/c/5174/download/Images_1.zip
or https://www.kaggle.com/c/avito-duplicate-ads-detection/data and extract them fast on VM RStudio for data analytics?

Best Answer

Have you installed RStudio in a Linux VM? If so you can ssh into your instance using command sudo gcloud compute ssh <your-instance-name> --zone <your-instance-zone> and then use wget from inside your instance to download the file:

wget https://www.kaggle.com/c/5174/download/Images_1.zip

wget might get disconnected during the download but you can use options described in above link that will help you make the download successful, like the -t and -c options for trying the download more times or continue getting a partially-downloaded file, respectively.

After the file is downloaded you can use 7ZIP to unzip the file in the directory where it was downloaded to using the command: 7z e Images_1.zip
You can copy the file to a GCP bucket using the command:
gsutil cp Images_1 gs://<your-bucket-name>

If wget and 7zip are not installed in the VM you can install them as per instructions, wget and 7zip, as follows. This examples are for Ubuntu or Debian Linux VMs:

sudo apt-get update
sudo apt-get install wget
sudo apt-get install p7zip-full

Just follow installation instructions.