| Svelte Hacker News

points by antoncohen 10 years ago

This feature is implemented at a low level, and works on the command line.

For example if you have a directory that is all stored in the cloud you can `cd` to it without any network delay, you can do `ls -lh` and see a list with real sizes without a delay (e.g., see that an ISO is 650 MB), and you can do `du -sh` and see that all the files are taking up zero space.

If you open a file in that directory, it will open, even from command line, then do `du -sh` and see that that file is now taking up space, while all the others in the directory are not.

You can right-click to pin files and directories to be stored locally, and right-click to send them back to the cloud so they don't take up space.

This is actually very different than traditional network file systems like SMB, NFS, WebDAV, and SSHFS. With a normal network file system over the WAN you would have major latency problems trying to `cd` and `ls` the remote file system. Most of them also don't have any ability to cache files locally when offline, or the ability to manually select which files are stored locally and which are remote. AFS does have some similar capabilities.

vmarsy 10 years ago

This is how Bitcasa is working I think. You would see all the files, in a fake hard drive of unlimited space. Some magic (machine learning and smart decisions) would try to figure out ahead of time my data access pattern.

So I would see all my pictures list, if I decide to open the first one, it would take a few seconds to download, and then I start browsing the pictures, it would figure out I plan to look at all of them, and pre-fetch them from the cloud, so there would be on average no perceived latency.

bigiain 10 years ago

Was working (didn't they just pull the plug on all their "unlimited" accounts?)
- j_s 10 years ago
  
  They pulled the plug on all their "unlimited" accounts a long time ago.
  They recently shut down their free plan.
  
  wspeirs 10 years ago
  
  Datto Drive just launched today and is giving away a TB of space for free: www.dattodrive.com

KindDragon 10 years ago

So, if I try search inside Dropbox or Documents folder it will download all files from Dropbox to my computer ?

mnsc 10 years ago

Valid question. I wouldn't be happy if I ctrl-f on "My Documents", do a search and a 1 TB download starts up invisibly in the background filling my hard drive.
- boduh 10 years ago
  
  I believe this is not the case here. In order for the files to start taking space on your drive you would actually need to right click that folder and choose "Save a local copy".
- btown 10 years ago
  
  Could Dropbox detect repeated access patterns from the same process, and/or whitelist processes as known "searchers," and start returning blank files? This seems like the kind of problem only a unicorn would dare to tackle, but as luck would have it...?
  
  serge2k 10 years ago
  
  So search is broken in these directories?
  That seems like a lousy tradeoff.
  
  mjs7231 10 years ago
  
  You want to save space by not having data on your local system but use a local search to look in the contents of files not on that system? You can't have your cake and eat it too.
  
  borski 10 years ago
  
  Sure you can: index it once, and then stream it forevermore.
  
  ctoshok 10 years ago
  
  except remote contents can change
  
  SturgeonsLaw 10 years ago
  
  Store a hash or checksum in the index and also allow the remote API to return the hash/checksum to see if the file has changed
- Angostura 10 years ago
  
  Perhaps the integration will extend to triggering a search on the back end.
  
  mjs7231 10 years ago
  
  I suppose any company that is giving all their encrypted data to Dropbox to begin with may be OK with it. But most companies are already sketched out by the mere fact that their data is accessible to anyone outside the company.
  In any event, if they were to index and provide search as a service as well, I wouldn't think it's something they do quietly. It would most likely include it's own huge marketing campaign.
dack 10 years ago

This is what I was wondering - we'd have to be careful writing a script that happened to traverse into the dropbox folder, because it might try to inflate all the files. It still seems like a cool idea, but I wonder if they have a workaround.
- antoncohen 10 years ago
  
  Well, the feature isn't released yet. I'm sure aspects of it will change based on test user feedback.
tedmiston 10 years ago

I wonder how this will work with Spotlight.
- gutnor 10 years ago
  
  Spotlight is enabled by default _and_ left enabled on basically all the Macs and Mac users are actually a big userbase for Dropbox. It is very unlikely Dropnox team will forget that Spotlight indexing is running in the background.
  Does not mean the files will get indexed, but there is no chance that Spotlight will trigger a unexpected terabyte download in the background.
antoncohen 10 years ago

(First off, as a disclaimer, I no longer work for Dropbox, I don't speak on their behalf. I've only used the feature as a user.)
I don't know a common search/find system that open()s or read()s files during the search by default. AFAIK Spotlight and Windows search are indexed searches. As for the indexing operations, I don't know how that is handled, they could disable indexing for remote files, or they could somehow integrate with indexing.
Based on my testing of a pre-released version of the feature (it isn't released yet), if you were to do something like `find ~/Dropbox -type f -exec md5 {} +`, it would download files.
As a user it did exactly what I expected. I was truly amazed. It was totally seamless and amazing.
Compared to the complexity of what has already been implemented, solving the problem of "I want to recursively open/read every file in my Dropbox, but I don't want it to download terabytes of data and fill by hard drive" seems fairly simple. For example there could be a setting for the maximum about of space Dropbox will use up, e.g., 40 GB, plus Dropbox could be smart enough to detect disk usage. If you `grep -R` it may download/open/read the files, once you reach 40 GB or near your disk capacity, Dropbox could start removing local copies of files that are not pinned to be local, i.e., remove the files that were downloaded because of the open()/read(), not the files you explicitly told it to keep local. I don't know how the team will choose to implement these features, but I'm confident that it will be well-thought-out and tested.
Remember, Dropbox is the company that especially monkey patched the Finder to get the sync icons (http://mjtsai.com/blog/2011/03/22/disabling-dropboxs-haxie/). They will go to great lengths for a seamless user experience, and do a ton of testing. I have no doubt that when Project Infinite is widely available it will be amazing, seamless, and have functionality many people thought wasn't possible or only dreamed existed.
- xyzzy_plugh 10 years ago
  
  > I don't know a common search/find system that open()s or read()s files during the search by default.
  ... grep?
  
  mjs7231 10 years ago
  
  and antivirus and I'm sure there's more.
  
  reledi 10 years ago
  
  I don't think antoncohen meant searching file contents, but point taken.
pbreit 10 years ago

Why would it do that? Can't it just be added to the search index and then "archived"?
- eropple 10 years ago
  
  grep doesn't have a search index.
  
  JoBrad 10 years ago
  
  But wouldn't you expect a grep to download the files so they could be searched, as opposed to a locate or find, which I wouldn't expect to do that.
  
  pbreit 10 years ago
  
  I thought I was inquiring about Dropbox, not grep.

rsync 10 years ago

"This is actually very different than traditional network file systems like SMB, NFS, WebDAV, and SSHFS. With a normal network file system over the WAN you would have major latency problems trying to `cd` and `ls` the remote file system."

Is that still true for sshfs ?

People used to ask us if they should rsync to us directly or sshfs mount and then rsync to the mount, and we told them not to do that since the original rsync stat would basically download all files simply to look at them / size them.

But I don't think that's the case anymore. I think sshfs (or perhaps something about FUSE underneath) is smart about that now ... isn't it ?

antoncohen 10 years ago

I haven't used FUSE SSHFS in around 8 years. I'm sure it has improved a lot since then. I could imagine it handling file listing and stat better than other network protocols (cd/ls over ssh works well over most WAN connections). It looks like it now caches directly contents too (https://github.com/libfuse/sshfs). It probably wasn't fair for me to include SSHFS in the list since I haven't used it in so long. I was troublesome when I used it.
mrsteveman1 10 years ago

> People used to ask us if they should rsync to us directly or sshfs mount and then rsync to the mount
Are you guys allowing full access to the machine now through LXC containers or some sort of VM?
- rsync 10 years ago
  
  "Are you guys allowing full access to the machine now through LXC containers or some sort of VM?"
  No - it is the customer, on the client side, that creates an sshfs mount representing their rsync.net account.
  It works very well and it is very nice to have a plain old mount point that represents your rsync.net account - especially since you can just browse right into your historical ZFS snapshots, etc.
  But in the past, people did that and they got the bright idea to rsync to that local mount point, to do their backups, and that didn't work well.
  But my understanding is that nowadays it would work better - you wouldn't download every single file that rsync simply stat'd or listed ...
  We still don't recommend it, though. No reason to add that complexity.

kame3d 10 years ago

This is similar to the idea behind git-annex.

DanielDent 10 years ago

git-annex also has the advantage of letting you keep data in multiple destinations (including in cold storage), which I think is becoming increasingly important.

machbio 10 years ago

Is it not the same way Skydrive on Windows 8+ work ?

EwanToo 10 years ago

Yeah I think it's largely the same functionality as OneDrive/SkyDrive has, it's really useful for businesses who've uploaded many TB to it.
duncans 10 years ago

They dropped that feature (and the ball) in Windows 10 https://www.thurrott.com/cloud/66733/project-infinite-bring-...
- etherealG 10 years ago
  
  lol
- orf 10 years ago
  
  I would be worried if Microsoft couldn't get the feature working correctly and had to drop it...

dingo_bat 10 years ago

This is sort of like how onedrive used to work on win 8. But they dropped the feature later.

the8472 10 years ago

> and you can do `du -sh` and see that all the files are taking up zero space.

That seems wrong to me. It would violate the assumptions of software that does stat() on directory entries and not only verifies presence but also non-zero size.

So it's risking buggy behavior to gain a latency edge over other networked filesystems. I think smart prefetching while preserving correctness would be better.

catwell 10 years ago
du uses `st_blocks`, not `st_size`, so it should be fine for most applications. It is similar to how sparse files behave:
```
    $ ls
    $ truncate -s 1M foo
    $ ls -lhp
    total 0
    -rw-r--r-- 1 catwell wheel 1.0M Apr 26 18:39 foo
    $ du
    0    .
```
- the8472 10 years ago
  
  All is good then.
- Freaky 10 years ago
  
  du -A for the "apparent" size.

tedmiston 10 years ago

I came to ask if this is really that different than Selective Sync. Glad to hear it is.

bobwaycott 10 years ago

So, is it working off some kind of always-in-sync (assuming a live network connection) manifest?

I confess, I did not watch the video, and only briefly skimmed the announcement.

Edit: I'm asking a genuine question of real technical interest here. How can this be implemented with no latency and real file sizes immediately available for inspection, while taking up no disk space? I went back and read the announcement again, and there are no hard details I can see that I missed in my initial skim. There has to be something stored locally, right? Hell, I'm running gigabit fiber here, and I still notice latency in the CLI for anything that requires a network connection. Perhaps I misunderstood the parent?