Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.
Listen now
Description
So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison! So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs). http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/ Apache Tika * https://tika.apache.org/ OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/  * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/ Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/ Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2 Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer And Follow us!  https://www.twitter.com/javapubhouse
More Episodes
We have a great time talking to Matt Topol from Voltron Data on one of his Apache Software Foundation projects called Apache Arrow. It's both a spec and implementation of a columnar data format that is not only efficient, but cross-language compatible. We walk through the scenarios that it covers...
Published 03/19/24
Ok, so it's an incredible time to be in the Java Ecosystem, and one of the biggest frameworks out there just dropped their three-point-oh version! That's right! So Spring Boot is not officially 3.0, and it has as a Baseline Java 17! (oohh!!). So we brought in the big guns to talk about what does...
Published 02/16/23
Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache Kafka, one of the (if not THE) dominant...
Published 11/08/22