A manual and automated CSV file upload solution for a website

apiasp.netftpiiswindows

Problem

I run a website which has a requirement for a new feature – to enable users to upload CSV files into the website. The users will only ever need to upload 1 CSV file at a time, and only 1 or 2 CSV files per day on average. I get to define the file format of the CSV, so I will be specifying what the fields are and the data types of each field, but I don't know how many rows will be in the file and I don't know what the specific data will be either. You can assume the CSV file will never be bigger than 1MB in size.

The website already exists, however the feature to upload CSV files is still in the planning stages. The website currently enables users to login and manage their data (ie. it has privilege control at the user level) and its all managed through an SQL Server database. The website runs ASP.NET on IIS7.5 – this cannot be altered. The server is not NATed but it is firewalled. If you need more information about the website, please ask.

From the user's perspective, they will be generating the CSV file on their PC using their own programs. 99% of the time the CSV file it will be generated by some form of commerial software package. The choice of package is up to them – there are many products in the market that generate the kinds of CSV file they will be uploading to my website (I don't want to be too specific here, so that I can maintain my privacy and that of my users). Some users may even occasionally generate the CSV file by creating or editing it in notepad or MS Excel. I will certainly recommend that they never use MS Excel, however what they do on their own PC is beyond my control.

There are two scenarios for users uploading the CSV file into my website:

Manual upload

A user logs-in to the website and uploads a CSV file

Automatic upload

The user somehow configures their CSV generating software and their PC to automatically upload CSV files to my website. I have no control over how they choose to do this, though I have spoken with some of my users and they have told me that their software is able to automatically generate the required CSV files into a directory (ie. the new CSV files appear in a directory about once a day without human intervention). They could then uploaded these CSV files automatically to my website by some form of client kicked off by a scheduled task (again, I have no control over the client the user chooses, though I can offer advise). This is by no means a requirement – they can do the steps I have described manually by clicking on their CSV-generating software, creating a file, and then manually running the client themselves to upload the CSV into my website if they want to. However the whole point of the automatic upload is that they can just leave it to run by itself so that IT support can use their time more productively.

The upload would generally be on a daily basis, but I will not be imposing any limits through my site on the number of uploads done per day. If they want to upload a new CSV file every hour then that's fine by me. However, from my discussions with users it sounds like a frequency of daily is going to be most convenient, and the time of day doesn't matter.

If the automatic CSV upload fails then my website will notify the user of this fact. I haven't decided exactly what form this notification will take yet since it really depends on the chosen solution. In the case where the upload fails due to the uploaded CSV file containing bad data then the user will need to manage this by exception (ie. a human will have to get involved – this is not something that software can fix). They will need to investigate the reason for the fault and correct the CSV file, probably by correcting the data in their database then using their software package to re-generate the CSV file. They can then upload the corrected CSV file at their leisure – uploads are not time-critical, and bad CSV files will be simply ignored by my website, rather than being partially loaded into my database.

Functional requirements

  1. Users must be able to upload files manually themselves through the website
  2. Users must be able to upload files automatically at scheduled intervals from their own PCs
  3. Only CSV files (or the equivalent data a CSV file contains) will be processed into my website's database
  4. The entire process MUST occur without intervention from my IT support (and preferably their IT support too, though that may be out of my control, depending on other chosen solution)
  5. The file upload process must not introduce any security risks that would not also be present when submitting data through a form in the website
  6. On my website's side, the file upload permissions must be configurable on a per-user basis using either IIS, ASP.NET or a table in SQL Server and the firewall must not block the uploads
  7. Whatever solution I implement must be well supported for at least the next 10 years into the future (as of 2016)
  8. The chosen solution must use standard software that is already cheaply available to users (preferably at no cost)
  9. There MUST be automatic feedback to the user when bad data is uploaded so they can correct it and re-upload. The shorter the time delay between uploading and receiving feedback the better.

(If you think I've missed some requirement for a manual or automated CSV-file-upload interface then please let me know in the comments and I can add these in)

Some thoughts on users uploading bad-data

Its almost a certainty that users will upload CSV files containing bad data. This may not even be intentional eg. they might accidentally put integers in a date-only field for example. However it could also be intentional, such as malicious users uploading executable files containing malware. For this reason the CSV file should only ever be read and parsed, never executed.

Solutions I have considered

The manual upload via the website is trivial and is not really the focus of this question. Users would log in as they currently do, and upload the file via the standard multipart/form-data interface in their web-browser. Once the upload is processed then they would receive feedback of any errors or a success message, along with auditing to show the history of uploads.

However I am trying to decide between two possible solutions for the automated upload interface and I can't decide which meets my functional requirements better (they both meet functional requirements 2, 3 and 4 equally in my opinion):

FTPS/SFTP

During the sign-up process, the website code would create a directory for each user to upload files to. They would each have unique login credentials for the FTP and these would be listed on the website once a user is logged in.

A Windows service would constantly scan the directories and when it detected that a user had uploaded a new file then it would process this file into the database.

If a CSV file with incorrect data were uploaded then I would send an email to the user, warning them that their CSV-file-upload has failed. The user could then login to the website to check the specific errors in that file. They could then re-try the upload either manually via the website (multipart/form-data interface) or via FTPS/SFTP.

As part of the file-upload process the user would need to signal to my Windows service that the upload is complete, either by renaming the uploaded CSV file, or by uploading a signal-file (eg. csv-xyz.complete) once the CSV file is uploaded.

Looking at this solution in light of the functional requirements…

  1. IIS can only handle FTPS, it cannot handle SFTP. Also, SFTP runs on a single port, whereas FTPS runs on multiple ports, which is probably no good with my firewall (I will test and update this). Are there any SFTP programs which run outside IIS that can be configured via ASP.NET? Is it possible to set up SFTP/FTPS directories with .NET? At a glance it seems so, however I would like to know if there are any limitations?

  2. How likely is SFTP/FTPS to be around in 10 years? It seems like an extremely popular technology today, but I would like to know if it is about to be superseded or is going obsolete without my knowledge.

API

The website already features an API, however this API does not currently have any file-upload capabilities. However, I could implement such a feature in an API and then users could upload the CSV-file data via this interface. The data would then never need to be placed into a CSV file at all – it would simply be security checked and sanity checked, then placed straight into the database.

Feedback to the user would be immediate – if any bad data is encountered then the response to the upload would contain an error message with specific details.

However it seems to me that it would require a lot of effort on the user's part to create the program to communicate with my API. I do not know of any standardized API file-upload clients to read a CSV and transmit the data to an API, though maybe some exist? If they do not then I doubt that many of my users would go to the effort of creating a whole new program just to upload a file to my website.

I have considered building a client and distributing it to users so that they can use it to upload CSVs via the API. But then raises a whole set of new problems – how would new releases be distributed? Would it be compatible with users' PCs? Would we get lots of queries of the form "its broken, how do I …"? Would file permissions be a problem? Etc.

So again, looking at this solution in light of the functional requirements…

  1. If I were to develop an API client then IT support would probably get a lot of calls of the type "how do I do x using your client". On the other hand if I did not design a client, then IT support would probably still get calls about the specifics of the API, and since these would be of a technical nature then I would probably have to respond-to and work closely-with the end-users myself to resolve these.

  2. Would the API offer any security advantages not already available to SFTP/FTPS? The main difference would be that no CSV file would be uploaded, and so there would never be the possibility of someone loading malware onto the server (even if it is immediately detected as not being a CSV file and deleted).

  3. This would definitely be compatible with IIS

  4. If I were to create an executable client for users then this may become obsolete in future versions of Windows, and would probably not function on Mac or Linux. It is also an extra piece of code floating around that would require bugfixes and upgrades.

  5. I am not aware of any standard for an API which implements a file upload. Google drive seems to have implemented one, but I don't know how widely accepted this is, and as far as I know they don't have a client which runs on a PC with access to the filesystem. If I don't develop an API client to give out to users then each user would be looking at weeks of development just to be able to upload a file into my website.

  6. The feedback to the user is more immediate and probably more "automatable" from a user perspective. Its not very easy for a program to scan for emails indicating a fault has occurred, however it is easy to take action when a HTTP response contains an error.

In conclusion

I'm leaning towards SFTP/FTPS simply because it satisfies more of the functional requirements. However, it does depend a bit on my further research to answer the questions raised above. Answers or thoughts on these would be very welcome.

I'm also after some advise from people who have already been there and tried one or both of these methods. What works well? What doesn't work? What are the non-obvious pitfalls? If you have no personal experience in this area but have links to where I can read up on this comparison then please post the links.

Best Answer

You want to connect two endpoints you cannot change - the file generation program, and the http or ftp endpoint, as your user sees it. And since there is no standard "plug", you need to create one, which has to be a script or program. If you want this to be user-friendly, and not only usable by "some IT experts", better bite the bullet and write and deploy a program.

You are not the first one who did this successfully, you can offer maintenance for every paying user (in fact, it is a motivation for your customers to pay for your service). If you write the program by yourself, it actually does not matter if your program uses ftp, ftps, sftp, or an http based api. For what you have in mind, you might also have a look on curl or libcurl. Consider to make your program cross-platform, either using Java, or Qt if you are a C++ guy.

Concerning this "must be supported for at least the next 10 years" - you are way IMHO too much thinking in terms of "I do never want to be bothered by my users" - better think in terms like "lets offer the users a solution which works now, and support them (in return for some money) if the solution does not work any more in the future". I mean, Internet communication technologies are so widely used that chances are high that whatever you choose, if is not too exotic, it will be still supported in 10 years for reasons of backwards compatibility. But even if is not, it is not so a big deal as long as your company is still alive in that period and can provide support. Just provide an update of your program to your users using a different technology under the hood.

Related Topic