Show HN: Open-Source AI Embedding Pre-Processing Editor

Vasyl_R 2 years ago

We are thrilled to open source Embedditor.AI

Embedditor is the open-source MS Word equivalent for embeddings pre-processing, that helps you get the most out of your vector search, while saving up to 30% on embedding and storage costs.

This solution is inspired by the experiences of over 30,000 IngestAI users. Our insights revealed a common bottleneck in AI and LLM-related applications, one that goes beyond LLM hallucinations or token limits, which are far easier to resolve. The prevailing issue lies in the GIGO (garbage in, garbage out) principle.

With no one-size-fits-all approach to chunking and embedding, certain models excel with individual sentences, while others thrive on chunks of 250 to 500 tokens. Blindly splitting chunks by the quantity of characters or tokens, and embedding content without normalization and with up to 30% of redundant noise (such as punctuations, stop-words, and low-relevance frequent terms) often leads to suboptimal vector search results and low-performing LLM-related applications using semantic or generative search. The issue was consisting in trying to enhance vector search using existing technologies, which proved to be as challenging for our users, as creating an outstanding document using a basic .txt format.

We decided to address the root problem, so we developed Embedditor - the Microsoft Word equivalent for embedding pre-processing, enabling with no background in data science or technical skills to improve performance of their vector search capabilities while saving up to 40% on embedding and storage. We've made Embedditor open-source and accessible to all because we genuinely believe that by improving vector search performance and boosting cost-efficiency simultaneously, Embedditor may have significant impact on current NLP and LLM industry.

>>>FEATURES

>>>Rich editor GUI

->Join and split one or multiple chunks with a few clicks;

->Edit embedding metadata and tokens;

->Exclude words, sentences, or even parts of chunks from embedding;

->Select the parts of chunk you want to be embedded;

->Add additional information to your mebeddings, like url links or images;

->Get a nice looking HTML-markup for your AI search results;

->Save your pre-processed embedding files in .veml or .jason formats;

>>>PRE-PROCESSING AUTOMATION

->Filteer out from vectorization most of the 'noise', like punctuations or stop-words;

->Remove from embedidng unsignificant, requently used words with TF-IDF algorithm;

->Normalize your embedding tokens before vectorization;

GitHub: https://github.com/embedditor/embedditor

We hope you love it, and we would love to hear your feedback.