Launch HN: Matano (YC W23) – Open-Source Security Lake Platform (SIEM) for AWS

140 points by wizwit999 3 years ago

Hi HN! We’re Shaeq and Samrose, co-founders of Matano (https://matano.dev). Matano is a high-scale, low-cost alternative to traditional SIEM (e.g. Splunk, Elastic) built around a vendor-agnostic security data lake that deploys to your AWS account.

Don’t worry — we’ll explain all this jargon in a second.

SIEM stands for “Security Information and Event Management” and refers to log management tools used by security teams to detect threats from an organization's security logs (network, host, cloud, SaaS audit logs, etc.) and send alerts about them. Security engineers write detection rules inside the SIEM as queries to detect suspicious activity and create alerts. For example, a security engineer could write a detection rule that checks the fields in each CloudTrail log and creates an alert whenever an S3 bucket is modified with public access, to prevent data exfiltration.

Traditional SIEM tools (e.g. Splunk, Elastic) used to analyze security data are difficult to manage for security teams on the cloud. Most don’t scale because they are built on top of a NoSQL database or search engine like Elasticsearch. And they are expensive — the enterprise SIEM vendors have costly ingest-based licenses. Since security data from SaaS and cloud environments can exceed hundreds of terabytes, teams are left with unsatisfactory options: either not collect some data, leave some data unprocessed, pay exorbitant fees to an enterprise vendor, or build their own large-scale solution for data storage (aka “data lake”).

Companies like Apple, HSBC, and Brex do the latter: they build their own security data lakes to analyze their security data without breaking the bank. “Data lake” is jargon for heterogeneous data that is too large to be kept in a standard database and is analyzed directly from object storage like S3. A “security data lake” is a repository of security logs parsed and normalized into a common structure and stored in object storage for cost-effective analysis. Building your own data lake is a fine option if you’re big enough to justify the cost — but most companies can’t afford it.

Then there’s the vendor lock-in issue. SIEM vendors store data in proprietary formats that make it difficult to use outside of their ecosystem. Even with "next-gen" products that leverage data lake technology, it's nearly impossible to swap out your data analytics stack or migrate your security data to another tool because of a tight coupling of systems designed to keep you locked in.

Security programs also suffer because of poor data quality. Most SIEMs today are built as search engines or databases that query unstructured/semi-structured logs. This requires you to heavily index data upfront which is inefficient, expensive and makes it hard to analyze months of data. Writing detection rules requires analysts to use vendor-specific DSLs that lack the flexibility to model complex attacker behaviors. Without structured and normalized data, it is difficult to correlate across data sources and build effective rules that don’t create many false positive alerts.

While the cybersecurity industry has been stuck dealing with these legacy architectures, the data analytics industry has seen a ton of innovation through open-source initiatives such as Apache Iceberg, Parquet, and Arrow, delivering massive cost savings and performance breakthroughs.

We encountered this problem when building out petabyte-scale data platforms at Amazon and Duo Security. We realized that most security teams don't have the resources to build a security data lake in-house or take advantage of modern analytics tools, so they’re stuck with legacy SIEM tools that predate the cloud.

We quit our jobs at AWS and started Matano to close the gap between these two worlds by building an OSS platform that helps security teams leverage the modern data stack (e.g. Spark, Athena, Snowflake) and efficiently analyze security data from all the disparate sources across an organization.

Matano lets you ingest petabytes of security and log data from various sources, store and query them in an open data lake, and create Python detections as code for realtime alerting.

Matano works by normalizing unstructured security logs into a structured realtime data lake in your AWS account. All data is stored in optimized Parquet files in S3 object storage for cost-effective retention and analysis at petabyte scale. To prevent vendor lock-in, Matano uses Apache Iceberg, a new open table format that lets you bring your own analytics stack (Athena, Snowflake, Spark, etc.) and query your data from different tools without having to copy any data. By normalizing fields according to the Elastic Common Schema (ECS), we help you easily search for indicators across your data lake, pivot on common fields, and write detection rules that are agnostic to vendor formats.

We support native integrations to pull security logs from popular SaaS, Cloud, Host, and Network sources and custom JSON/CSV/Text log sources. Matano includes a built-in log transformation pipeline that lets you easily parse and transform logs at ingest time using Vector Remap Language (VRL) without needing additional tools (e.g. Logstash, Cribl).

Matano uses a detection-as-code approach which lets you use Python to implement realtime alerting on your log data, and lets you use standard dev practices by managing rules in Git (test, code review, audit). Advanced detections that correlate across events and alerts can be written using SQL and executed on a scheduled basis.

We built Matano to be fully serverless using technologies like Lambda, S3, and SQS for elastic horizontal scaling. We use Rust and Apache Arrow for high performance. Matano works well with your existing data stack, allowing you to plug in tools like Tableau, Grafana, Metabase, or Quicksight for visualization and use query engines like Snowflake, Athena, or Trino for analysis.

Matano is free and open source software licensed under the Apache-2.0 license. Our use of open table and common schema standards gives you full ownership of your security data in a vendor neutral format. We plan on monetizing by offering a cloud product that includes enterprise and collaborative features to be able to use Matano as a complete replacement to SIEM.

If you're interested to learn more, check out our docs (https://matano.dev/docs), GitHub repository (https://github.com/matanolabs/matano), or visit our website (https://matano.dev).

We’d love to hear about your experiences with SIEM, security data tooling, and anything you’d like to share!

sullivanmatt 3 years ago

This issue exists to the right of your solution and is (for now) out of scope, but the biggest issue I have with security data lakes is the need to (easily) get both row-based data and visualizations. Back when I had access to a well-built and cared for Splunk environment, I would constantly run queries, build visualizations, go back to the results index, tweak the query, go back to viz, etc. This feedback loop is important and allows for fast iteration, especially if you are conducting a high-stakes investigation and need answers rapidly. I should be able to look at my available fields and tweak the viz accordingly in under a few seconds; preferably in one mouse click.

Now I live on an ELK stack and I experience nothing but full-time agony as I switch between Kibana and Kibana Lens constantly. It's clear they are two completely separate "products" built for different use-cases. The experience reminds you constantly that they were not purpose-built for how I use them, unlike Splunk.

Increasingly we are moving towards the reality of a security data lake, and all I can think is that I'm about to lose what little power I had left as I have to move to something like Mode, Sisense, or Tableau which again, were not purpose-built for these use-cases and even further separate the query/data discovery and visualization layers.

I hate how crufty and slow Splunk has gotten as an organization, and they use their accomplishments from 15 years ago to justify the exorbitant price they charge. I really hope the OSS/next-gen SaaS options can fill this need and security data lake becomes a reality. But for that to happen, more focus is needed on the user experience as well.

Regardless, very cool stuff and could definitely fill a need for organizations that are just starting to dip toes into security data lakes. I wish you success!

shaeqahmed 3 years ago

I completely agree with you and the need for a fully integrated solution with great visualizations without hosting additional tools that aren't purpose built! Unfortunately there are very few SIEMs that get this right today..
Here's how we are thinking of it. We think it's important for a successful security program to first have high quality data and this is why we want help every organization build structured security data lakes to power their analysis using our open source project. The Matano security lake can sit alongside their SIEM and be incrementally adopted for a data sources that wouldn't be feasible to analyze otherwise.
Our larger goal as a company though is to build a complete platform that allows a security data lake to fully replace traditional SIEM -- including a UI and collaborative features that give you that great feedback loop for fast iteration in detection engineering and threat hunting as you mentioned. Stay tuned I think you will be excited by what we are building!
- sullivanmatt 3 years ago
  
  For sure. Pull a dbt and get everybody hooked on your tool, then slap a SaaS platform ecosystem to the farthest right and watch the revenue flow.
  
  mox1 3 years ago
  
  Splunk is HEAVILY pushing their SaaS offering at the moment. They are the most obnoxious vendor we currently deal with.
  We are fine on prem, pay big $$ license fees, but not enough. They want that sweet SaaS revenue.
  I would be wary of pushing this, being a non-SaaS platform could be an advantage here.
  
  tomhallett 3 years ago
  
  I’m assuming the difference is: “big $$ license fees” for on-prem is $X a year, while “sweet saas revenue” is $A a year, $B per user, $C for compute, $D for storage, and $E for requests.
  As a large company, what are the things you are more than happy to pay for with on-prem?
  The reason I’m asking: this feels like the largest issue with cloud saas, which is one of the more popular implementations of open-core for B2B. Not saying Splunk is open-core, but it’s related to above/dbt cloud discussion.
  Enterprise customers have the highest propensity to pay, but don’t need or want their cloud offering.
  Mid-tier customers actually prefer a managed service by their cloud provider, aws/gcp/azure, because it strikes a balance between easy AND it works within their vpc/iam/devops. But this cuts off open-core companies main revenue, so they start making ELv2 licenses (elastic, airbyte, etc) which makes things harder on mid-tier.
  Small customers are the ones who love saas the most, but have the least ability to pay, have the least need for powerful tools, and will probably grow out of being a small customer…
  I’m curious if there are any companies which are: source code available, commercial license, allow you to fork/modify the source code, only offer on-prem (no cloud saas offering), want the mega-clouds to offer a managed service. BUT the commercial license requires any companies over 250 employees or $X revenue (docker desktop style) to pay a yearly license fee.
  
  feanaro 3 years ago
  
  Indeed, no more SaaS. I've had enough of this cloud nonsense already.
  
  baetylus 3 years ago
  
  Say more? Are you tired of it personally or is it troublesome at work?
wrldos 3 years ago

The biggest issue I have with data lakes is they always without fail turn into a data cesspool. The more you add the less ROI you get out. And yes using Splunk as an example it becomes an organisational cost problem. I have spent way too many hours arguing with them over billing.
The only viable solution is design metrics into your platform properly from the ground up rather than trying to suck them out of a noisy datasource for megabucks.

mdaniel 3 years ago

I loaded the GitHub link, bracing myself for yet another AGPL license, but no, it's Apache 2! So I wanted to say thank you for that and I hope to take a deeper look when I'm back at my desk because trying to keep Splunk alive and happy is a monster pain point. There are so many data sources we'd love to throw at it but we don't have the emotional energy to put up with Splunk crying about it

shaeqahmed 3 years ago

Thank you! We definitely believe in open source and don't need AGPL. Sending you love as you deal with that Splunk instance.
P.S. feel free to open some issues for any log sources you'd like to see supported in Matano

jillesvangurp 3 years ago

Interesting project.

A few remarks though.

- Doing real time data processing on tera/peta bytes involves a lot of IO, which is a significant part of of the cost in AWS. Things like Athena are simply not cheap to run at that scale.

- With time series data, the emphasis is usually on querying recent data, not all of the data. You retain older data for auditing for some time. But this can essentially be cold storage.

- Especially alerting related querying is effectively against recent data only. There's no good reason for this to be slow.

- People tend to scale Elasticsearch for the whole data set instead of just recent data. However, with suitable data stream and index life cycle management policies, you can contain the cost quite effectively.

- Elastic Common Schema is nice but also adds a lot of verbosity to your data, and queries. Bloating individual log entries to a KB or more. Parquet is a nice option for sparsely populated column oriented data of course. Probably the online disk storage is not massively different from a well tuned elastic index.

- Elastic and Opensearch have both announced stateless as a their next goal. So, architecturally similar to this and easier to scale horizontally.

- SIEM is just one use case. What about APM, log analytics, and other time series data? Security events usually involve looking at all of that.

shaeqahmed 3 years ago

Matano is completely serverless and stores all data in ZSTD compressed parquet files in dirt-cheap object storage, allowing you to bring your own analytics stack for queries on large amounts of data for things like investigations and threat hunts. Since we store data in a columnar format and plug in query engines like Snowflake that are optimized for analytical processing the queries on specific columns will run much faster than they would run if executed on a search engine database like Elasticsearch which would require maintenance to scale.
I think it's important to understand that search engines and OLAP/data warehouse query engines have fundamental architectural differences that offer pros/cons for different use cases.
For enterprise security analytics on things like network or endpoint logs which can hit 10-100TB+/day, using anything other than a data lake is simply not a cost-effective option. Apache Iceberg was created as a big data table format for exactly this type of use case at companies like Netflix and Apple.
- jillesvangurp 3 years ago
  
  You are not wrong, but I do think realtime and olap have been converging a bit for a while.
  Stateless elasticsearch and opensearch are actually moving to a similar model as what Matano proposes. They both have made announcements for stateless versions of their mutual forks. Data at rest with that will live in s3 and there are no more clusters, just auto scaling ingest and query nodes that coordinate via s3 and that allow you to scale your writes and reads independently. Internal elasticearch and opensearch data formats are of course heavily optimized and compact as well. Recent versions have e.g. added some more compression options and sparse colunn data support.
  But they are also optimized for good read performance. There's a tradeoff. If you write once and read rarely, you'd use more heavy compression. If you expect to query vast amounts of data regularly, you need something more optimal because it takes CPU overhead to de-compress.
  For search and aggregations, you either have an index or you basically need to scan through the entirety of your data. Athena does that. It's not cheap. Lambda functions still have to run somewhere and receive data. They don't run locally to buckets. Ultimately you pay for compute, bandwidth, and memory. Storing data is cheap but using it is not. That's the whole premise of how AWS makes money.
  Splunk and Elasticsearch are explicitly aimed at real-time use cases (dashboards, alerts, etc.), which is also what Matano seems to be targeting. But it can also deal with cold storage. Index life cycle management allows you to move data from hot, warm, and cold storage. Cold here means snapshot in S3 that can be restored on demand for querying. It also has rollovers and a few other mechanisms to save a bit on storage. So, it's not that black and white.
  Computing close to where the data lives is a good way to scale. Having indexing and caching can cut down on query overhead. That's hard with lambdas, athena, and related technology. But those are more suited for one off queries where you don't care that it might take a few seconds/minutes/hours to run. Different use case.

molsongolden 3 years ago

Excited to give this a try and follow your progress!

In case anybody else is wondering how Matano compares to Panther (my first thought reading this launch post) there's a comparison on the Matano website[0].

Quick note to the Matano team, the "Elastic Common Schema (ECS)" link in the readme[1] seems to be broken.

[0] https://www.matano.dev/alternative-to/panther

[1] https://github.com/matanolabs/matano#-log-transformation--da...

wizwit999 3 years ago

Thank you, fixed the link!

protoduction 3 years ago

Hi Shaeq and Samrose - congrats on the launch! Matano looks great.

Out of curiosity, at some point I believe you were working on a predecessor called AppTrail whic tackled (customer-facing) audit logs, it was something I was interested in at the time (and still am! I would've loved to use that).

Would you perhaps be willing to share your learnings from that product, and (I assume) why it evolved into Matano?

shaeqahmed 3 years ago

Thank you! Yes with AppTrail we wanted to solve the pain points around SaaS audit logs but since it was a product that needed to be sold and integrated into B2B startups rather than the enterprises that felt the pain points and needed audit logs in their SIEM, we couldn't find a big enough market to sell it.
We realized that the big problem was that most SIEM out there today did a poor job with pulling and handling the data from the multitude of SaaS and Cloud log sources that orgs have today, and decided to build Matano as a cloud-native SIEM alternative :)

bovermyer 3 years ago

Oh, this very much has my attention. I'll be checking this out in depth.

brunes 3 years ago

How do you position this against AWS's own Security Lake announced at re:Invent in November (https://aws.amazon.com/security-lake/) ?

Your architecture diagram looks like a carbon copy of theirs.

shaeqahmed 3 years ago

We launched before Amazon Security Lake :)
Amazon Security Lake's main value prop is that it is a single place where AWS / partner security logs can be stored and sent to downstream vendors. As such, Amazon only writes OCSF normalized logs to the parquet-based data lake for it's own data in a fully managed way (VPC flow logs, Cloudtrail, etc.) and leaves it to the customers to handle the rest.
For partner sources, the integration approach has been to tell customers to set up infrastructure themselves to accomplish OCSF normalization, parquet conversion, etc. For example, here is okta's guide using Firehose and Lambda, https://www.okta.com/blog/2022/11/an-automated-approach-to-c...
The Amazon Security Lake offering is built on top of Lake Formation, which itself is an abstraction around services such as Glue, Athena, and S3. Security Lake is built using the legacy Hive style approach and does not use Athena Iceberg. There is a per-data cost associated with the service, in addition to the costs incurred by other services for your data lake. Looks like the primary use case of the service is being able to store first-party AWS logs across all your accounts in a data lake and being able to route them to analytical partners (SIEM) without much effort. It does not seem very useful for an organization that is looking to build its own security data lake with more advanced features, as you will still have to do all the work yourself.
Matano, has a broader goal to help orgs in every step of transforming, normalizing, enriching and storing all of their security logs into a structured data lake, as well as giving users a platform to build detection-as-code using Python & SQL for correlation on top of it (SIEM augmentation/alternative). All processing and data lake management (conversion to parquet, data compaction, table management) is fully automated by Matano, and users do not need to write any custom code to onboard data sources.
Matano can ingest data from Cloud, Endpoint, SaaS, and practically any custom source using the in-built Log transformation pipeline (think serverless Logstash). We are built around the Elastic Common Schema, and use Apache Iceberg (ACID support, recommended for Athena V2+). Matano's data lake is also vendor neutral and can be queried by any Iceberg-compatible engine without having to copy any data around (Snowflake, Spark, etc.).

boundlessdreamz 3 years ago

What distinguishes a SIEM from traditional log analysis? I know the feature set is oriented towards SIEM but it seems like a super set of regular log analysis. I don't have a need for a SIEM now but this looks good even for non security logs.

shaeqahmed 3 years ago

Yep, SIEM is just a superset of Log Management as it needs to do things like alerting + correlation + detection etc. in addition to ingesting logs to be considered a SIEM.
It is a common use case to send application logs along with security logs to something like Matano or Splunk for analysis as well, so feel free to use Matano to analyze your non-security logs!
Do keep in mind this will be a better fit if you have structured logs (you can also use VRL transformation to parse them at ingest) as the query language will be SQL.

waihtis 3 years ago

I'm a vendor in the cyberspace so not a potential customer (feel free not to waste time answering) but am just intellectually curious who you're targeting this at. High-skill tech companies who are just building up a security program? I don't see most security teams building their own SIEM'ish solution just because they really don't have the chops or resource to do it. OTOH, it would be a big rip-out operation for F100 companies to change to this from Splunk et al.

shaeqahmed 3 years ago

Many enterprises using Splunk are already being forced to purchase products like Cribl to route some of their data to a data lake because writing it all to Splunk is just way too expensive at that scale 1-100TB+/day (7 figures $).
But a data lake shouldn't just be a dump of data right? Matano OSS helps organizations build high value data lakes in S3 and reduce their dependency on SIEM by centralizing high throughput data in object storage using Matano to power investigations. To give you an example, one company is using Matano to collect, normalize, and store VPC Flow logs from hundreds of AWS accounts which was too expensive with traditional SIEM.
Matano is also completely serverless and automates the maintenance of all resources/tables using IaC so it's perfect for smaller security teams on the cloud dealing with a large amount of data and wanting to use a modern data stack to analyze it.
- waihtis 3 years ago
  
  nice thanks, makes a lot of sense
lmeyerov 3 years ago

(not them, but in this space with major enterprises and gov agencies deploying Graphistry)
We are pretty active here with security cloud/on-prem data lakes teams as a way to augment their Splunk with something more affordable & responsive for bigger datasets. Imagine stuffing netflow or winlogs somewhere at TB/PB scale and not selling your first born child. A replacement/fresh story may happen at a young/midage tech co, and a bunch of startups pitching that. But for most co's, we see augmentation and still needing to support on-prem detection & response flows.
It's pretty commodity now to dump into say databricks, and we work with teams on our intelligence tier with GPU visual analytics, GPU AI, GPU graph correlation, etc. to make that usable. Most use us to make sense of regular alert data in Splunk/neo4j/etc. However, it's pretty exciting when we do something like looking at vpc flow logs from a cloud-native system like databricks and can thus look at more session context and run fusion AI jobs like generating correlation IDs for pivoting + visualizing.
Serverless is def interesting but I've only seen for light orchestration. Everyone big has on-prem footprint which is an extra bit of fun for the orchestration vs investigation side.
- waihtis 3 years ago
  
  Thanks, this is interesting. I work a bit closer to the "source" as in doing detection things on on-prem & cloud side, so not too well-versed on the data processing and management side.

debarshri 3 years ago

I have been exploring this realm of SIEM, XDR, NDR etc. Sure, all proprietary SIEMs are expensive. But what is not clear is how you are going to price it. Security teams have dedicated budget. If you are coming cheaper than them, they you are destroying your TAM because I know customer would not mind paying those license fees. OSS GTM might work but might against your TAM.

0x4e53 3 years ago

At least from my time at SpaceX - this is untrue.
SIEM costs were rapidly ballooning, and we were being charged by RAM. RAM?? Of all things!!
After our SIEM costs for ELK ramped up to where Splunk was - we just bought Splunk instead. I imagine there are many security teams out there that would entertain a cheaper alternative that isn't priced by RAM.
- slt2021 3 years ago
  
  the reason for that is near real-time detection of threats requires aggregation of terabytes of data according to rules (continuous GROUP BY on thousands columns on a sliding window) - and these aggregates by design have to be stored in RAM.
  Otherwise these detections stop being near-realtime and become offline detection instead, just like any other sql server.
  
  0x4e53 3 years ago
  
  To be clear - we were hosting on-premise, and being charged for our own RAM. Servers we had to buy, and then pay for the privilege of using with ELK.
shaeqahmed 3 years ago

We think building a more efficient solution using data lakes is a win-win because it unlocks additional use cases for customers and allows them to analyze larger datasets within the same budget.
Solutions that offer a magnitude of order better performance than what is available today are critical for the industry because the amount of data teams are dealing with is growing much faster than their budgets!

badrabbit 3 years ago

What distinguishes Matano'd existing or planned products from Google Chronicle? Would you have any limits on data ingestion or retention?

Also, python detections sounds horrible! I love python but it sounds like you haven't considered the challenges of detection engineering. This one of my main "expertise" if you will. You should think more in the lines of flexible sql than python. People who write detection rules to the most part don't know python and even if they do it would be a nightmare to use for many reasons.

I hope someone from your team reads this comment: DO NOT try to invent your own query language but if you do, DON'T start from scratch. Your product could be the best people who like the fabulous splunk need to also like it. And for a security data lake, you must support Sigma rule conversion into your query/rule format. Python is a general purpose language, there are very good reasons why no one else from Splunk,elastic, graylog, Google,Microsoft use Python. Don't learn this hard lesson with your own money. Querying it needs to be very simple and most importantly you need to support regex with capture groups and the equivalent of "|stats" command from splunk if you want to quickly capture market share. I have used and evaluated many of these tools and have written a lot of detection content.

Your users are not coders, DB admins or exploit developers. They are really smart people whose focus is understand threat actors and responding to incidents -- not coding or anything sophisticated. FAANG background founders/devs have a hard time grasping this reality.

shaeqahmed 3 years ago

Some big differences:
- Matano has realtime Python + SQL detections as code with advanced correlation support. Chronicle uses inflexible YARA-like detection rules iirc
- Matano supports Sigma detections by automatically transpiling them to the Python detection format
- Matano has an OSS Vendor Agnostic Security Data Lake and can work with multiple clouds / let's you bring your own query engine (Snowflake, Spark, Athena, BigQuery Omni). Chronicle is a proprietary SIEM that uses BigQuery under the hood and cannot be used with other tooling.
There are no limits on data retention or ingestion with Matano, it's your S3 bucket and the compute scales horizontally.
- badrabbit 3 years ago
  
  Thanks for the response. Chronice uses Yara-l and bigquery uses sql on steroids. Both are difficult to start working with them. I would want someone that has never even looked at python code to be able to query the data. Having a different query langauge than detection language is also a big problem (e.g.graylog). I will keep an open mind, I prefer python but it is not ideal for getting a wider audience (general IT staff) to use it. Junior staff prefer chronicle over splunk because they can put in an IP or domain and just get results. Now ask them to learn python and you have a revolt.
  I looked at your sample detection on the home page. This is have for me but I can't get others to use it. I promise you, doing a little market research on thid outside of the tech bubble will save you a lot of money and resources.
  
  shaeqahmed 3 years ago
  
  Long term, I believe Python (along with good ol' SQL for correlation) is the best language to model the kind of attacker behaviours companies are dealing with in the cloud and a lot of the difficulties with it are not inherent but around tooling. For example, in our cloud offering we plan on building abstractions that let you search for an IP or domain and get results with a click of a button as well the ability to automatically import Sigma rules and test Python logic directly with an instant feedback loop of a "low-code" workflow.
  Currently we focus on more modern companies with smaller teams that have engineers that can write Python detections and actually prefer it over a custom DSL that needs to be learned and has restrictions.
  Keep in mind there are more people in general that know Python than are trained in a vendor-specfic DSL so perhaps long term the role of a security analyst will evolve to overlap with that of an engineer. We are already seeing more and more roles require basic proficiency in Python as attacks on the cloud become increasingly complex :)
  
  badrabbit 3 years ago
  
  > get results with a click of a button as well the ability to automatically import Sigma rules and test Python logic directly with an instant feedback loop of a "low-code" workflow.
  Ok, so importing sigma rules is the easy part, it takes on average 2-3 hours of tuning a sigma imported rule to get it to where it is usable in a large environment where you have all sorts of false positives. The language in question should not me making a fuss about indentation or importing the right module. You never (to my knowledge) need loops, classes,etc... Python is great just not purpose built for this use case. Most companies, even fortune 50 companies can't get many people in their security team who know or are willing to learn python well. You need someone to write/maintain it, someome to review it and the people responding to detections would want to read it and understand it. I am not saying python is difficult, just that you have to take the time to learn it. Detection engineering is all about matching strings and numbers and analyzing or making decisions on them. You have to encode/decode things in python, deal with all kinds of exceptions, it is very involved compared to the alternatives like eql,spl,yara-l,etc... But them again, maybe your customers who want to run their own siem datalake in the cloud might also have armies of python coders. But generally speaking, it is rare (but it happens) to find people interested in learning python but also doing boring blue team work. I would love python so long as I don't have to deal with newer python versions requiring refactoring rules.
  > Currently we focus on more modern companies with smaller teams that have engineers that can write Python detections and actually prefer it over a custom DSL that needs to be learned and has restrictions.
  Fair enough, honestly, if your focus is silicon valley Python is great. You will just get a reputation about what your product demands of users if you ever want to branch out. The only time I have ever done a coding interview wss with a statup like typical YC funded type company. I am just warning you the world is different outside the bubble. I would want to recommend your product and I will probably mention it to others but it looks like you know what you want.
  > Keep in mind there are more people in general that know Python than are trained in a vendor-specfic DSL so perhaps long term the role of a security analyst will evolve to overlap with that of an engineer. We are already seeing more and more roles require basic proficiency in Python as attacks on the cloud become increasingly complex :)
  Attacks on the cloud are not that complex but python does not make the job easier just more complicated. And I spend at least 10-15% of my time writing Python so I am not hating on it.
  The golden standard is splunk. Nothing, I mean absolutley no technology exists that even comes close to splunk. Not by miles. Not any DSL or programming language. Do you know why CS Falcon is the #1 EDR, similary all alone at the top? Splunk!
  Even people that leave splunk to start a competitor like Graylog and Cribl can't get close.
  A detection engineer is a data analyst (not a scientist or researcher) that understands threat actor's TTPs and the enterprise they are defending well. I wish I wasn't typing on mobile so I can give you an example of what I mean. None of the sigma rules out there come close to the complexity of some the rules I have seen or written. Primarily, I need to piece together conditions and analysis functions rapidly to generate some content and ideally be able to visualise it. It doesn't matter how good you are with a language, can you work with it easily and rapidly enough to analyze the data and make sense of it? Maybe you can get python to do this, I haven't tried. But you are not going to compete with Splunk or Kusto like that. The workflow is more akin to shell scripting than coding where you cam easily pipe and redirect IO.
  E.g.: "Find GCP service account authentication where subsequently the account was used to perform operations it rarely performs and from IP addresses located in countries from which there has not been a recent login for that project in the last 60 days"
  I am just giving you an example of what a detection engineer might want to do, especially if they've been spoiled by something like Kusto or Splunk SPL. That's the future not simple matches.
  The role of a security analyst and engineer already overlap, modern security teams have a lot of cross functionality where everyone in part is involved and embedded with other teams that have the same security objective (detect threat actors in this case).
  Just to show you the state of things. I can't get a team solely dedicated to security automation to write one simple python script to solve humongoud problem that we have. We spent over a year arguing and battling over the solution. The hangup was they would be responsible for maintaining it, so instead they wanted an outside vendor to do it.
  At a completely different company, I wrote a small python script to make my life analyzing certain incidents easier and that started a territorial battle.
  What you consider a simple python script requires consultants and many meetings amd reviews and approvals at bigcompanies and they're constantly worried the guy who wrote the script can't be replaces easily. I am telling you all this so you understand how people making purchasing decisions think: they are very much on the buy side of things than build once a company gets to a certain size when their main business is not technology services.
  So I hope you also think about offering managed detection/maintenace services (kind of like where socprime is going) in the long term.
  Finally, I think your strategy is to have an exit/ipo in a few years and if you don't see long past that, I have no doubt you will succeed. And I am very happy to hear about another player in this field. I have even pitched building an in house datalake solution similar to this except with tiered storage where you drop/enrich/summarize data at each tier (you need lots of data immediately but less details and more analysis of that data at each tier).
  I wish you the best of luck!
  
  shaeqahmed 3 years ago
  
  Thank you for typing up a long detailed response. I think a lot of the points and concerns you bring up are valid, and we are mostly agreed upon.
  In Matano however, we see Python as a viable component in security operations for narrowly tracking atomic signals while the language for writing detections and hunting threats will be SQL, which works perfectly well for use cases like the detection example you provided, albeit verbose. We have thought of also building a transpiler that would let analysts actually use the succinct syntax of SPL and compile that to SQL under the hood. This could be a great way to get adoption in companies where using Python would be difficult.
  If you are interested, I would love to find some time to chat and share thoughts. Can you email me at shaeq at matano dot dev?
  
  badrabbit 3 years ago
  
  Thanks for the well thought out response. I hope Matano succeeds. I can't email you since my hn presence isn't public/social but I might be involved in evaluating your product some day soon and would chat and share thoughts with your folks then.
  
  carom 3 years ago
  
  Do you have contact info to consult about this stuff in a few months? Building something adjacent and analyst usability is top of mind.
  I started my career doing detections (Snort / ClamAV) but have been out of the loop doing development for a while. A fresh perspective would be helpful.
  
  badrabbit 3 years ago
  
  Sorry,can't consult but if I see a post from you asking about this, I will be sure to respond/discuss that in detail.

alecbell 3 years ago

This is awesome. Nice work open-sourcing it! I used Splunk at Expedia and it was super expensive and slow. While I wasn't using it for security purposes, it could take 15-30 min for us to detect error logs, and I can imagine that's not okay for security purposes. Good luck guys!

sullivanmatt 3 years ago

Wowsa. Somebody didn't do their job right if it took anywhere near that amount of time to get logs back. Sorry it was so painful.

slt2021 3 years ago

Question to Matano authors - won't your solution simply enrich AWS by blowing up my cloud bill ?

Did you estimate how many times lambda will get invoked and what will be AWS bill for 1 million events ingested? I am curious to learn the price to pay for serverless SIEM

shaeqahmed 3 years ago

The code is written in high performance multi-threaded Rust and uses the [1] Arrow compute framework. We also batch events and target about 32MB of event data per lambda invocations. As a result it can process tens of thousands of events per second per thread, depending on the number of transformations.
That said, we are working on performance estimates and a benchmark on some real world data for Matano to help users like you better understand the cost factors. Stay tuned.
[1] https://github.com/jorgecarleitao/arrow2

wdb 3 years ago

Anyone aware of a similar solution for Google Cloud / GCP?

shaeqahmed 3 years ago

We are working on a solution for GCP and Azure :) GCP recently announced Iceberg support with BigLake and support for federation across multi-cloud lakes so it would be perfect use cases.
If you are interested in using Matano for GCP, feel free to reach out and join our Discord community! We are FOSS so would love to collaborate on a solution.
- parkerhiggins 3 years ago
  
  Definitely looking forward to GCP.

simonebrunozzi 3 years ago

Shaeq and Samrose: for us investors here, where are you in terms of fundraising? I'm an ex AWS (google me, you'll have a few laughs!), turned VC in the past few years. $HN_username at gmail if you want to reach out and chat!

Edit: here's me with Andy, from a millenium ago [0].

[0]: https://www.youtube.com/watch?v=bWL0_Xdntzo&t=2907s

napolux 3 years ago

Super random question... I wonder if the name is related to Frank Matano, the italian youtuber/comedian.

simonebrunozzi 3 years ago

Funny, I had the same in mind! (he's really funny!)
shaeqahmed 3 years ago

Nope, Matano is named after one of the deepest lakes in the world - Lake Matano of Indonesia ;)
- napolux 3 years ago
  
  TIL
  Thanks!