g9yuayon a year ago

The github repo has 5 million lines of C++ code (headers included), 1.6 million lines of C code, and even nearly 1 million lines of Scala + Java code. We'd need some serious docs to adopt this technology.

The most interesting part of YT is Cypress. I'm particularly interested in how they make their master cluster horizontally scalable.

  • gritukan a year ago

    Historically, the master server of YTsaurus was a single RSM (replicated state machine) that contained all the meta-information about the cluster. This included the tree of the distributed filesystem, transactions, information about users and tables, placement of chunks, and much more.

    However, this approach proved to be non-scalable as the memory amount and throughput of the master server soon became insufficient. To address this issue, we implemented Multicell technology. With Multicell, there are multiple RSMs called secondary masters that store information about chunks of the tables and their placement. The primary master still stores information about the distributed filesystem and transactions but is now single and non-sharded.

    After a few years, the masters became overloaded again, and we implemented Portals. With Portals, one can select a subtree of Cypress and place it in one of the secondary masters. This technology is used nowadays, and home directories of some active users are hosted on secondary masters.

    However, we anticipate that this approach will also become insufficient in a few years. Therefore, we are currently working on a new technology called Sequoia, which stores information about the Cypress tree shape in horizontally scalable dynamic tables.

    It is hard to describe all aspects of master server internals in one comment. Therefore, feel free to join our chat at t.me/ytsaurus for further discussion!

    • ddorian43 a year ago

      > Therefore, we are currently working on a new technology called Sequoia, which stores information about the Cypress tree shape in horizontally scalable dynamic tables.

      Why not just use a database for the metadata? Something that can be sharded and has transactions like YugabyteDB/Yandex-ydb/etc?

      • gritukan a year ago

        Our objects have complex semantics of changes so it seems challenging to implement the whole Cypress over k-v storage.

        Storing Cypress nodes in the ad hoc RSMs and information about tree in k-v storage seems a good compromise that is both scalable and allowing to implement any functionality for objects efficiently.

  • karsinkk a year ago

    I was going over some of the code in the core folder for concurrency, threading and compression, what surprised me is that there’s absolutely no comments whatsoever. Agree that unless there’s excellent documentation, open source maintenance might be challenging.

    Having said that, this definitely does look to be an impressive feat of engineering!

ALLTaken a year ago

I've read the whole article, it's very long, but it was written in such a refreshingly clean way. Hats-up!

I believe the engineers really spent a lot of their time building this from the ground up. I'm thankful that those large companies open-source their software and algorithms.

gritukan a year ago

Hello! I work at YT and would like to answer a question that was asked in a flagged thread about the comparison between YT and Hive and Zookeeper.

Both Cypress and Zookeeper are fault-tolerant distributed hierarchical filesystems that can be used for distributed coordination, but Cypress has much richer functionality.

Recall that Zookeeper's data model is just a tree consisting of homogeneous nodes that can be either ephemeral or persistent, along with a set of sessions that control the lifetime of ephemeral nodes. This simple model allows to implement multiple primitives of distributed synchronization, such as leader election, exactly-once queue processing, or two-phase commits. However, it is not always easy to integrate Zookeeper with third-party systems. For example, if you want to elect a leader via Zookeeper and use it to insert data into a database, it is mandatory that the instance remains the leader during the commit into the database, which is not easy to implement without races or some additional assumptions. In YTsaurus, transactions permeate our entire system. You can start a transaction and acquire an exclusive lock at some Cypress node (which is a way to make a leader election), and after that, the transaction becomes the leader lease. You can then modify Cypress, run MapReduce or YQL operations using the transaction as a prerequisite, lock some files and tables in the same transaction, and do many other things. Currently, we are working on the ability to use Cypress locks as prerequisites for dynamic table commits. There are many other features in Cypress that are not implemented in Zookeeper, such as symlinks, automatic expiration of unused nodes, and many others. Moreover, Cypress can be sharded using Portals about which I wrote in a previous comment, so this filesystem is scalable unlike Zookeeper. Even without sharding a single primary master of YTsaurus can hold tens of gigabytes of metadata of Cypress while Zookeeper state size is limited with hundreds of megabytes accoring to etcd vs Zookeeper comparision [1].

One major disadvantage of Cypress compared to Zookeeper is the lack of watches, so all changes tracking should be done via short polling. The good news is that Cypress is well-optimized for read queries with the possibility to read from followers and from multiple threads, so this is not a big problem. In the meantime, we are considering the possibility of adding some kind of watches to Cypress.

The big difference between Cypress and Zookeeper is the replicated state machine implementation. With all due respect to Zookeeper developers, Zookeeper was implemented over 15 years ago when the world of distributed algorithms was different. Today we see that ZAB (the consensus algorithm used in Zookeeper) has some shortcomings in failover speed and stability. There are multiple reports of Zookeeper being unstable under heavy load. In YTsaurus, we use an in-house library called Hydra for RSM implementation. This is our consensus algorithm very similar to RAFT that has proven itself to be both efficient and fault-tolerant. We use Hydra for master servers, clock servers, and tablet cells (RSMs that store data in dynamic tables). I even had an idea to implement a Zookeeper API using Hydra both to simplify migration to YTsaurus and check Hydra performance and correctness via multiple tests implemented for Zookeeper (Jepsen, for instance), but did not have enough time to finish this project.

This comment is already quite long, so I will write about the YT vs Hive comparison in another comment later on.

[1] -- https://etcd.io/docs/v3.3/learning/why/

  • thriftwy a year ago

    I wonder if this thing is some kind of descendant of Elliptics used in Yandex internally a decade ago.

ushakov a year ago

How many companies exist that will utilise its full capabilities? My bet would be 50-100

  • xpl a year ago

    My guess it could be successfully used for relatively small (terabyte-scale) workloads.

    It's the same as when people use k8s not utilizing its full capabilities, only to be able to massively scale up when needed.

  • arivkin a year ago

    Why do you think so? How many companies use Apache Hadoop?

  • RobotToaster a year ago

    Most people don't utilise the full capabilities of the tools they use.

    Sure it would be overkill for a lot of applications, but so is redis, react.js, etc.

karsinkk a year ago

This looks very impressive! As another commenter echoed, the code base is ~5million lines of C++ code, but almost no comments at all. Unless the documentation is excellent, maintenance/open source work is going to be difficult.

  • xpl a year ago

    The docs, for the reference: https://ytsaurus.tech/docs/en/

    P.S. I wonder if LLMs could be used to generate docs and comments for big hairy codebases. Seems that the current generation of LLMs lack context to do it, but maybe it's "just one or two more papers down the line"®...

    • klysm a year ago

      The cost of wrong docs is pretty high. You’d need someone knowledgeable to make corrections

VWWHFSfQ a year ago

It's my understanding that the original Yandex founders and tech left Russia several years ago. Is this new technology? Or is it stuff being extricated from the (now) Kremlin-controlled codebase?

  • xpl a year ago

    It isn't new, in the article they state that it originated back in 2006. It's been in development since then.

garbagecoder a year ago

Didn't this just leak?

  • xpl a year ago

    AFAIK it wasn't in the leak (that "YT" platform specifically). Also, open-sourcing proprietary projects of this scale is a pretty hard job and it can't be done quickly — seems they had been doing it for a long time, starting long before the leak.

    • arivkin a year ago

      That's true. We have been preparing for opening YT for a long time when we found out about the leak. We even joked inside that they couldn't even do it normally and we still have to open our part of the code ourselves :)

      It's truly a hard work, because we were very tightly tied to the Yandex infrastructure and we had to learn how to deploy in k8s from scratch. Also you need a new brand and clean your documentation from irrelevant things... All this takes months.

magundu a year ago

Is it an alternative to Snowflake?

  • alephone1 a year ago

    We provide YQL for running large-scale OLAP SQL queries. In this regard YT can be compared to Snowflake. However, we target on prem deployment in the first place, while Snowflake runs in aws/gcp/azure, and queries are performed over data that sits in S3.

  • rch a year ago

    I work at Cloudera and I see it as roughly analogous to CDP, but I didn't see anything about hybrid on-prem/cloud deployments.

    • arivkin a year ago

      It's true. We are very close to CDP and Apache Hadoop. We are only about on-prem right now. It's not a secret that we didn't need a cloud deployment at Yandex. But we see and understand the demand for it. And we have a plan:)

      • rch a year ago

        Interesting! I'm looking forward to seeing what you have lined up.

reisse a year ago

There is an interesting take that Russian part of the Yandex group opensources as much as possible, in order for the overseas companies of the group to leverage technologies without legal or financial ties with Russia.

For me this seems very plausible, as for the last year they first did everything to distance from anything related to politics (e. g. they sold their news and their blogging platform to the basically state-owned VK), and then to separate Russian and overseas businesses as much as possible.

  • drewda a year ago

    It does look like Clickhouse transmogrified from a Russia-based Yandex open-source project into a San Francisco-based VC-back C-corp incorporated in Delaware just in the nick of time.

    (Not suggesting this is wrong. I'm just offering an interpretation based on skimming public blog posts over the past ~2 years.)

  • 0xDEF a year ago

    Yandex is incorporated in the Netherlands and the founders live in Israel.

    I think they will create several spin-off open source companies (like ClickHouse Inc.) outside Russia to continue doing B2B business with the outside world.

    • malaya_zemlya a year ago

      Unforytunately, only one founder, Arkady Volozh, is still alive. Ilya Segalovich has died 10 years ago.

    • deepsun a year ago

      It doesn't matter where the company is incorporated -- when Kremlin wants the access to the data Yandex cannot say no. One cannot operate in Russia and not play by Kremlin rules, especially media companies.

  • slt2021 a year ago

    russian Yandex no longer belongs to original Yandex founders/owners - it belongs to and is controlled by kremlin.

    makes sense that original engineers/founders create their own stuff via opensourcing their original work

    • reisse a year ago

      Then why original Yandex founder is under EU sanctions?

      Spoiler: because until June 22 (effectively until the sanctions hit) he was a Yandex CEO and owner of 8% shares (45% voting shares).

      There is no "almighty Kremlin" that owns everything. There is, however, a set of rules you must comply to if you want to do multibillion dollar business in Russia. You either bend, or sell your business to more complacent oligarchs. Durov chose the latter, Volozh chose the first.

  • tiffanyh a year ago

    Yandex has definitely made it extremely easy to leverage their tech, by selecting to use the Apache 2.0 license.

  • the_mitsuhiko a year ago

    Would be a clever move but I have doubts that no longer Yandex has much to say in Russian Yandex. The latter seems very close to the Kremlin now.

galkk a year ago

There is a conspiracy theory that recent open sourcing of Yandex tech and to some extent even a leak is a preparation for global Yandex exodus from Russia.

  • speed_spread a year ago

    You can take Yandex out of Russia... but you can not take Russia out of Yandex. Admittedly, the same could probably be said about any other $big_search:$nuclear_power pair. It's just that the other companies either can't or have no reason for exile.

    And so I suspect Yandex leaving it's home turf would essentially be a covert invasion of wherever they'd swarm to. Somehow, money tells me UK would be a likely target. Source: worked briefly for an exiled Russian company. Would not repeat.

    • ClumsyPilot a year ago

      > a covert invasion of wherever they'd swarm to.

      > an exiled Russian company.

      Either you are exhiled or you are invading, how can you be both?

      • speed_spread a year ago

        Deception. Moving a piece on a game board should have more than one effect. One should preferably make it so that the most obvious effect (exile) is not the one that's actually the most valuable in the long term (taking root in adverse country).

        Bonus "inception" points if you can make the adversary believe that you did it because they forced you to (sanctions).

      • helge9210 a year ago

        For example, you wholeheartedly support the policy, but are not comfortable living under the sanctions.

        • ClumsyPilot a year ago

          My understanding of Exile is that you are rejected by the society you are exiled from, due to your actions or beliefs. Like profound disagreement with the policy.

          I have certainly seen the kind of people you talk about, they support (or at least used to) Putin, but don't want their kids to live in Russia. I would call them more like emigrants of convenience.

  • slt2021 a year ago

    exodus has already happened

IYasha a year ago

Funny to see 0 of 50 comments on a HN tech article being tech-related...

  • l0b0 a year ago

    Agreed, but this is presumably also an absolutely massive project hardly anyone here has even used before. So it's not surprising that there are no big tech insights on the day of the release. An `scc` printout might be interesting, but any in-depth analysis is going to take a long time.

    • arivkin a year ago

      Here are some developers from YT which can help you and answer some technical questions. Also YT is a huge and old project. I believe you can find a lot of people which are ex-Yandex who worked with YT.

  • booi a year ago

    I think what you're seeing is a resurgence of tech ethics where the source and support of a project is as important as the technology itself.

  • hotstickyballs a year ago

    It’s not technically interesting since a lot of big data solutions already solve this problem so I guess only the geopolitics are left.

    • hamilyon2 a year ago

      I am obviously biased, but, yes, technically it is very, very interesting. Distributed transactions, kiparis, YQL are interesting. Another aspect, the only other open alternative is Hadoop, hbase and hive. They don't compete with yt on usability and developer experience aspects. Yt is much more polished, despite historical quirks.

    • xpl a year ago

      I doubt about "a lot". Also, "already solve" does not mean "solve better" or even "good enough".

      It would be very interesting to see some in-depth comparisons with already-existing open source technology (like Hadoop, Hive, Iceberg, ZooKeeper) to get a sense of when and where YT could be more effective.

cynicalsecurity a year ago

[flagged]

  • bertil a year ago

    I’m all for holding companies that have supported dangerous regimes to account. However, when it comes to data management, totalitarian regimes rarely indicate inadequate implementation. IBM’s role in Germany in the 40s was horrific, but it proved their ideas of tabulations and files were promising. Just like with rocketry, there were many valuable things to learn that defined the rest of the XXth century.

    The FSB likely has a lot of crimes to atone for. Still, suppose one of their specialists publishes something on data management or how to manage hundreds of sock-puppet social media accounts. In that case, I’d be tempted to listen and learn from a likely expert—unless you suspect that they think this article is not sincere and meant as a distraction from actual good practices.

    Similarly, the CIA has done very problematic things, but the people who worked in the disguise department have a creative take on changing your appearance. I’m unsure when I would have to do that, but I’m always curious about how data is stored efficiently. And yes, like the FSB, the NSA has opinions about that, and those are typically well-informed.

    Was their practice constitutional? Seemingly not, IANAL. But do they have good insights into caching video files at scale? Definitely.

  • throwaw12 a year ago

    brainwashing at its peak.

    When USA does things, it is for the good of society, democracy. When Russia does things, it is hurting people, bad for society.

    Come on buddy, time to wake up and understand every country does things for its own good and whatever your media is telling about Russia is bad, it is because they're applying 3 letter agency brainwashing methods on to you.

    Code is open source, if you read code you will not get under Russian propoganda.

    • Nginx487 a year ago

      Would you remind when US committed genocide, mass execution of POWs, execution and torture of civilians, mass rape of women and children, and absolutely widespread looting in occupied towns? I'm waiting.

    • garbagecoder a year ago

      [flagged]

      • throwaw12 a year ago

        this discussion will not add any value to the tech community of HN. My main concern of parent comment is, it is adding politics to the discussion of recently open sourced tech piece, instead of focusing on its capabilities, trade-offs and shortcomings and using it as a chance to learn constraints of another systems (in this case Yandex)

        it's always whataboutism when it doesn't fit the narrative, why haven't you asked same question to the parent comment when they tried to politicise technical product discussion?

      • newaccount2023 a year ago

        [flagged]

        • garbagecoder a year ago

          look at all of these throw away accounts making bad arguments.

          Yes, we look at everyone's record. Both records.

      • medo-bear a year ago

        nice head in the sand

        • dang a year ago

          Hey, could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

          Also, it looks like you've been using HN primarily for political/national/ideological battle, and that's another (distinct) line at which we ban accounts—regardless of what you're battling for or against. Past explanations here: https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme....

          If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

        • garbagecoder a year ago

          That's a hilarious response from a Russia apologist. You think telling other people they're naive about the CIA means you can ignore all the bad stuff about Russia but that's not how the world works and you are either paid to have say this or are doing it for free and I'm not sure which is worse.

          I try not to put myself in the hands of any spy agency.

          "Other team bad so my team good" is reasoning barely worthy of the human mind.

          • medo-bear a year ago

            there is no need to slander me. im definitely not paid by anyone to give my opinions

            imperialism is imperialism. it should be denounced

            if you dont think there is a US imperialist element involved in Ukraine you need to inform yourself a bit more. you can start with reading about North Stream 2

            before the war in Ukraine i used to think that, in comparison the US, at least Russia bombs countries without calling it freedom [1]. When Russia started its "special operation" against Ukraine i now think that at best Putin is trolling the US, and at worst he is trying to be the US

            [1] https://en.m.wikipedia.org/wiki/Operation_Enduring_Freedom

  • ivan_gammel a year ago

    Maybe this is exactly the reason why it’s worth checking? They open-sourced it.

    • Aldipower a year ago

      Yeah, like Apache Mesos is running the NSA data center in Utah. One of the biggest in the world.

      • mdaniel a year ago

        Do you happen to have any link where I can read more about this?

  • omgtehlion a year ago

    Oh come on already...

    Of course FSB has a backdoor to *data* collected by Yandex, but the code itself (as well as coders) have nothing to do with any three-letter-agency within Russia.

    • screamingninja a year ago

      > the code itself (as well as coders) have nothing to do with any three-letter-agency within Russia.

      citation needed

      • eddsh1994 a year ago

        Why don't you read the code? The amount of people commenting on this, if there was something to hide they'd have probably caught it already and if not surely by next week.

      • dna_polymerase a year ago

        Let's keep in mind, the only companies that were ever found to collude with three-letter agencies were the big players in the US. But for some reason, Yandex needs to prove something here…

  • jones6ofMont a year ago

    [flagged]

    • Idiot_in_Vain a year ago

      Russia now is a democratic country???

      Only a complete idiot would think that.

    • stef25 a year ago

      > remember FSB is NOT KGB because Russia now is a democratic country and not a communist USSR

      Secret services in a democratic country being better / worse than their communist counterparts aside, "democracy" is a bit of stretch when describing Russia.

  • DanTheManPR a year ago

    As opposed to the IT department of the NSA?

perryh2 a year ago

[flagged]

  • maxdo a year ago

    It's a problem of entire Russia. My Asian girlfriend from china was denied to enter a restaurant in Moscow because of her origins...

    It's absolutely legal to put restrict your rental apartments nationality. In yandex you can ads like " will lease my house only to russians". It's everywhere.

    If you will ride a taxi in russia, an offensive words to other nationalities/ethnicities is everywhere.

    The problem is so big, that even non russian/slavic people applied to offend each other. I've been in a taxi when two middle east decent people start screaming to each other "churka" even they are both could apply to this definition.

    I personally trying not to invest into anything Russian, simply because that society is very, very sick and they are very far away even from recognition of this problem. They think it's their strength...

    • TechBro8615 a year ago

      > I've been in a taxi when two middle east decent people start screaming to each other "churka"

      Another way of interpreting this sort of culture is that they know not to take things too personally and that intent behind words is more important than the words themselves. It's a kind of liberating way of interacting with each other, and not uncommon to see in environments like fraternities or sports teams. Some people will call that culture exclusionary, but I might call those people neurotic.

      To put it another way, who are you to be offended on behalf of the people calling each other churka? Clearly they're not offended by it, so shouldn't you let them have their fun?

    • konart a year ago

      > My Asian girlfriend from china was denied to enter a restaurant in Moscow because of her origins...

      While I'm not questioning your statement - this sound very questionable, considering the number of Chinese tourists in Moscow every year.

      >It's absolutely legal to put restrict your rental apartments nationality

      >I personally trying not to invest into anything Russian, simply because that society is very, very sick

      Do you as a gaijin invest in anything japanese in this case? It is not uncommon to see a "gaijins are not allowed" sign in Japan either.

      While it is true that Russia has (and will probably have for a long time) some racism problems (just like superiority complex) - I wouldn't say it's as bad as you think it is (especially compared to 90s for example)

    • ClumsyPilot a year ago

      > It's absolutely legal to put restrict your rental apartments nationality.

      I believe it is still illegal 'hypothetically', just like hypothetically Russia has elections.

      > They think it's their strength

      Yeah, the delusion is serious

      • orbital-decay a year ago

        It's illegal but the rental contract is private, so it can be denied without explanation; good luck suing and proving that you're being discriminated.

        The discrimination in public ads and places like restaurants is strictly illegal, though essentially not enforced unless you sue, because of the multitude of reasons - racism and lax anti-discrimination policies in particular, and also a million of others (the entire rental housing market is almost unregulated, for one). Lately, most popular ad sites implemented their own discrimination bans, however it still doesn't guarantee that the landlord won't be an asshole.

        • medo-bear a year ago

          > It's illegal but the rental contract is private, so it can be denied without explanation; good luck suing and proving that you're being discriminated.

          How is this different than in other places in Europe

          • orbital-decay a year ago

            When worded like this, it's not that different. It's the little details that add up, depending on the actual country. You're much less likely to be discriminated against in UK or Germany than in places like Bulgaria, Ukraine, or Russia. Due to both the attitude and enforcement. The rental market in Germany seems over-regulated, but my black friend of Ethiopian descent (he's Russian, born and raised) had no problem finding a place to live there, while in Russia he's been overtly or silently rejected by the landlords so often so he had to rent the apartment from myself for a year despite it being far away from his work.

            • medo-bear a year ago

              i wonder if blatant discrimination against black people is less likely to occur in places like UK, Germany, France, Netherlands, Belgium, Australia, Canada, and US (ok not really the US but lets say for sake of argument) because of the collective guilt white people in those countries carry for colonization, slavery, and other shitty things; whereas slavic and other eastern european people never really did any of those really shitty things to black people. not only that but if i remember correctly ussr and early china provided quite a bit of aid to third world countries in order to overthrow collonialism. that said any discrimination against people is plain disgusting

    • trallnag a year ago

      What were you doing in Russia if it's so bad?

      • IYasha a year ago

        Rollin', Hatin' )

  • metafates a year ago

    It was used for filter lists in search engine. I think you understand, that google is using the same exact technique. It's just haven't been leaked

  • TylerLives a year ago

    Does anyone know how they used the slurs?

    • marwis a year ago

      Looking at the OP comment before it was flagged, someone just did `s/slave/n****r/g` on the codebase.

      BTW It's pretty ridiculous that American censorship makes it impossible to even call out and criticize racism like in the case of OP.

    • slt2021 a year ago

      banlist of words to filter out from search results, something like that

sciencesama a year ago

Who cares ?

  • konart a year ago

    Same people who use Clickhouse or YDB at least.