Show HN: Samchika – A Java Library for Fast, Multithreaded File Processing
29 comments
·May 23, 2025Calzifer
for(int i=0;i<10000; ++i){
// do nothing just compute hash again and again.
hash = str.hashCode();
}
https://github.com/MayankPratap/Samchika/blob/ebf45acad1963d..."do nothing" is correct, "again and again" not so much. Java caches the hash code for Strings and since the JIT knows that (at least in recent version[1]) it might even remove this loop entirely.
hyperpape
Even in older versions, if the compiler can see that there are no side-effects, it is free to remove the loop and simply return the value from the first iteration.
I'm actually pretty curious to see what this method does on versions that don't have the optimization to treat hashCodes as quasi-final.
A quick test using Java 17 shows it's not being optimized away _completely_, but it's taking...~1 ns per iteration, which is not enough to compute a hash code.
Edit: I'm being silly. It will just compute the hashcode the first time, and then repeatedly check that it's cached and return it. So the JIT doesn't have to do any _real_ work to make this skip the hash code calculation.
So most likely, the effective code is:
computeHashCode();
for (int i = 0; i < 10000; i++) {
if (false) { // pretend this wouldn't have dead code elimination, and the boolean is actually checked
computeHashCode();
}
}
sureglymop
Perhaps I misunderstand something but doesn't reading from a file require a system call? And when there is a system call, the context switches? So wouldn't using multiple threads to read from a file mean that they can't really read in parallel anyway because they block each other when executing that system call?
porridgeraisin
> system call, the context switches
No, there is no separate kernel "executing". When you do a syscall, your thread becomes kernel mode and it executes the function behind the syscall, then when it's done, your thread reverts to user mode.
A context switch is when one thread is being swapped out for another. Now the syscall could internally spawn a thread and context switch to that, but I'm not sure if this happens in read() or any syscall for that matter.
mike_hearn
System calls aren't context switches. They flip a permission bit in the CPU but don't do the work a context switch involves like modifying the MMU, flushing the TLBs, modifying kernel structures, doing scheduling etc.
Also, modern filing systems are all thread safe. You can have multiple threads reading and even writing in parallel on different CPU cores.
bionsystem
If you open() read-only I don't think it blocks (some other process writing to it might block though).
SillyUsername
An ArrayList for huge numbers of add operations is not performant. LinkedList will see your list throughput performance at least double. There are other optimisations you can do but in a brief perusal this stood out like a sore thumb.
pkulak
I've literally never seen a linked list be faster than an array list in a real application, so if you're right, this is kinda huge for me.
fedsocpuppet
Huh? It'll be slower and eat a massive amount of memory too.
sidcool
It would be even more amazing if it had tests. It's already pretty good.
ldjkfkdsjnv
I could write this library with an llm in a few hours
codetiger
Do you have a benchmark comparison with other similar tools?
VWWHFSfQ
Am I wrong in thinking that this is duplicating lines in memory repeatedly when buffering lines into batches, and then submitting batches to threads? And then again when calling the line processor? Seems like it might be a memory hog
gavinray
Please don't do this.
Have the OS handle memory paging and buffering for you and then use Java's parallel algorithms to do concurrent processing.
Create a "MappedByteBuffer" and mmap the file into memory.
If the file is too large, use an "AsynchronousFileChannel" and asynchronously read + process segments of the buffer.
papercrane
If you're using a newer JVM you can also map a "MemorySegment", which doesn't have the 2GiB limit that byte buffers have.
gavinray
Good point, have written about this in the past
https://gavinray97.github.io/blog/panama-not-so-foreign-memo...
90s_dev
Knowing nothing about Java or compsci, I am very curious to see the in depth discussion by all you Java/compsci experts that your comment invites.
SillyUsername
Better caveat that with, "but watch memory consumption, given the nature of the likes of CopyOnWriteArraylist". GC will be a bitch.
switchbak
Memory mapping is fun, but shouldn't we have some kind of async IO / uring support by now? If you're looking at really high-perf I/O, mmaping isn't really state of the art right now.
Then again, if you're in Java/JVM land you're probably not building bleeding edge DBs ala ScyllaDB. But I'm somewhat surprised at the lack of projects in this space. One would think this would pair well with some of the reactive stream implementations so that you wouldn't have to reimplement things like backpressure, etc.
exabrial
try not to be a dick
null
johnisgood
[flagged]
null
pritambarhate
What's wrong in it? LLM is a tool which makes one more productive.
johnisgood
I said "I am not saying it is wrong", but it is getting a bit tiring that every single README.md is the same. All I wanted to know is if it is wrong to assume.
It is not wrong, but at least put yourself into it a bit.
Hi HN, I built a Java library called SmartFileProcessor to make high-performance, multi-threaded file processing simpler and more maintainable.
Most Java file processing solutions either involve a lot of boilerplate or don’t handle concurrency, backpressure, or metrics well out of the box. I needed something fast, clean, and production-friendly — so I built this.
Key features:
Multi-threaded line/batch processing using a configurable thread pool
Producer/consumer model with built-in backpressure
Buffered, asynchronous writing with optional auto-flush
Live metrics: memory usage, throughput, thread times, queue stats
Simple builder API — minimal setup to get going
Output metrics to JSON, CSV, or human-readable format
Use cases:
Large CSV or log file parsing
ETL pre-processing
Line-by-line filtering and transformation
Batch preparation before ingestion
I’d really appreciate your feedback — feature ideas, performance improvements, critiques, or whether this solves a real problem for others. Thanks for checking it out!