Hi, founder of Chartio here. Other than the title (which makes me a little sad) I liked your post. Its great to fully inform people of the security tradeoffs and you've done a nice job of laying out the levels and options of security that we've spent a lot of time developing.
In anything that is cloud based there is going to be some level where some hacker could get in and destroy everything. Most people on this site use cloud hosted servers, all of which would be at risk if Amazon or Rackspace got hacked. BI in the cloud is a new space and will be cautiously entered by some, but benefits will outweigh the potential risks and just as has happened in every other segment of cloud computing.
Let's go for less sad. I've changed the title a little to better reflect the intent of the post.
With regards to cloud - you're right, Amazon and Rackspace and so on are a single point of failure for a lot of businesses... but they also have a lot of people dedicated specifically to keeping their systems secure. The average startup, on the other hand, doesn't.
I'm curious, what was the reason you chose to highly Chartio out of all the companies in the cloud BI space? We actually feel that we have the best security practices in the space, mostly due to the fact that we're the only ones not doing data warehousing, where you're required to upload a copy of your database to the provider.
Luck of the draw, actually. Someone in an IRC channel I'm in mentioned it [in the context of "someone asked me to set this up, I told them heck no"], I glanced at the page, did a double-take.
It's debatable whether DWH is more or less secure than your approach - and it also depends heavily on how the DWH is done. Having to explicitly move data around also gives an opportunity to scrub it.
For the record, the proactive approach you're taking with your responses here is heartening. The goal of my posts is always, in the end, to push for something better, not just tear down what's there. Glad to see that you've an open mind towards improvements.
Those guys for sure have a lot more manpower than we do (so far!) but I might also point out that security isn't totally about how many people you have working on a problem, but how simple you're able to make it. But - that's not me getting into a security argument...
I'm a founder of Emergent One, a startup that uses similar agent-based technology to Chartio to access production databases and build out RESTful APIs. Some of our APIs are write-enabled, which makes the proposed risk even higher than that of Chartio's.
We've spent a lot of time thinking about security risks and writing code to reduce them. I thought I'd share a few things we do and that we've learned from our experience:
* The agent approach is the most popular because it allows for a system administrator to easily sever the connection from the database server without having to worry about writing queries to revoke user access.
* We never run unindexed queries without an explicit request from a customer and a manual entry from an Emergent One employee.
* We're currently looking into security consultants to continuously test our production environment.
* We're building an appliance version of our software much like Github Enterprise in order to accommodate the customers that aren't comfortable with their data hitting the cloud.
* We strive to have very quick and personal customer service directly from engineers. The vast majority of the responses are within the hour.
and last but certainly not least...
* The very best thing we can do is be honest and straightforward about the inherent risk behind our platform. Being able to build and maintain a pristine level of trust is the only thing that will keep us in business.
I'm sure Chartio does things very similarly. Direct-database access technology is not for everyone, but it's also proving to be extremely valuable for both Chartio's customers and ours. The cloud advantage that makes most SaaS software great is still there.
Having a product that also involves connecting to other people's databases we're pretty well versed in this problem. I will admit its a bold step to allow an external service access to your production databases but as a trade off of convenience vs security the former does win more often then people would think. This is especially true for people using PaaS/DBaaS providers which by default allow access from all inbound IPs (ie not white listed).
One thing we try to do is be open about all this and we explicitly mention it in our docs[1]. This includes instructing users to explicitly limit the permissions that they grant.
Another poster mentions loading a pre built VM. thats actually our goal for the enterprise as there will always be systems are not (and should not!) be accessible to the open Internet. In the meantime though there plenty of folks happy with the convenience of living in the cloud.
While there are clearly some significant security considerations, the author writes Chartio off without considering applications where it could be a good fit.
Also, the oozing, patronizing tone of the author is really annoying. Just make your points and be nice about it.
From the article: "It’s quite possible that the risks I’ve highlighted above are ones that you feel it’s okay to take, and in that case, go for it – as long as you’re respecting your end users’ interests as well."
Chartio certainly isn't the only company guilty of this.
Take Splunk, a $4 billion dollar (in market cap) machine log analytics service. They rolled out a new service called DB connect which allows you to integrate database analytics with their platform. Their implementation of DB connect is equally intrusive. However, I can still see these services being useful for visualizing operations that are less critical.
One could write a whole essay on the inability of this $4bn company to fix the most basic bugs in their product. I'd love to see evidence of people actually usefully using the features like this that Splunk provides.
I work in the BI / analytics space, with a fair amount of what's called "big data". The security of something like cloud IO is typically not too much of a concern on the kind of projects I work on: more often than not, our clients deploy dedicated analytics databases. Perhaps I'm biased as we tend to work from the outside, but I'd be surprised to find many sysadmins allowing chart IO to connect to the production DB. The ETL process for analytics databases tends to either obscure user data or aggregate it to an impersonal level of detail.
I'm more concerned about the more enterprise-y products aggressively advertising 'enterprise' security features like Active Directory integration, smoothing over their total lack of transparency in vulnerabilities and bugs.
In our case we set up a dedicated cheap database server only for ChartIO. Then we have a script that copies to it analytics data from our production servers every few hours (while also anonymizing what needs to be just to be extra safe).
This is a good approach. I work with ERP systems for large companies, and it's typical to separate the transactional database (OLTP) from the analytics database (OLAP). There's usually a replication process to copy transactions into the reporting system.
Great use case! We don't support an API that you can push data to, but that's because whatever API we build, it would never be easier to learn or have as much language support than just spinning up a PostgreSQL instance and writing the data to it. Quite a few of our customers use us this way.
You guys should really suggest this approach in the setup guide (it can be phrased in a way not to scare potential users). The alternative of directly accessing production database servers is really risky.
This seems like what you'd do just to protect your production server from some crazy load/etc. issues from analysis, too. I can't imagine giving read-only access to a core rw database server to anything which didn't need it.
+1 for getting non-operational stuff off of the production servers. In my opinion, production servers should be left untouched by analytics and what not. I'd prefer not tying up resources that could better serve customers.
These types of services just can't be SaaS. Ideally they would provide a prebuilt VM to drop in your environment (like GitHub Enterprise does). I had been struggling with the same quandry with regards to a code review system and a monitoring system I have been working on.
Sadly the alternative is data warehousing (which all of our competitors do). Its the worst solution as it
1. gives a copy of your data to someone else
2. is legally questionably still your data
3. is stagnant (as you usually only backup once a week)
4. there is no way to track where that data goes
Well, I'd say 1, 2 and 4 also apply to ChartIO ;) We like your service after comparing various ones but for us the advantage vs data warehousing is a much easier setup and UX.
The options typically are:
1) Run this package on your own (SQL/HDFS) server and pay use for a licence. Keep your own data and maintain your own servers.
2) Send us the data and we will store a replicate which we serve back to you in dashboards.
3) Let your users use your data on your servers through our web interface.
What we have done is use Chartio but connect it to our Slave database. I like this approach for many reasons. Our charts implicitly test that our slave connections are working well. If compromised, it wont immediately impact our site. You could argue it will eventually. All the load from Chartio is on our slave which again does not impact our production.
A Y-combinator startup asked me to set up chartio for them. The script generated some random error during setup, so I logged a support ticket with chartio. A month later they still hadn't bothered replying. I'll forgive you having a buggy setup script, but I won't forgive you when you don't even bother replying to my support query about your buggy product.
Super odd... We use Zendesk and I don't have a record of any open support tickets. Feel free to send me a note directly to dave at chartio.com if you're still having an issue.
I see it is closed now. I opened the ticked on June 25 2012 and it was resolved on April 1 2013 (although he didn't actually resolve the problem). Anyway, the client used google analytics instead.
Also your support system is annoying: why do we need to register for support if we're already registered on your site? And having to have a password with uppercase and symbol is overkill for a support system (and is pointless anyway, as you'll probably have to write the password down).
I've seen stuff like this pretty regularly, and I'm always dumbfounded by how people don't even think about how bad this is.
This is the exact definition of a "crunchy center", and I wouldn't be surprised if they do other horrible things. You know for certain there are no passphrases on those SSH keys either.
In anything that is cloud based there is going to be some level where some hacker could get in and destroy everything. Most people on this site use cloud hosted servers, all of which would be at risk if Amazon or Rackspace got hacked. BI in the cloud is a new space and will be cautiously entered by some, but benefits will outweigh the potential risks and just as has happened in every other segment of cloud computing.
(will write more soon)