MarkItDown: Python tool for converting files and office documents to Markdown

81 comments

·December 13, 2024

simonw

If you have uv installed you can run this against a file without first installing anything like this:

    uvx markitdown path-to-file.pdf

(This will cache the necessary packages the first time you run it, then reuse those cached packages on future invocations.)

I've tried it against HTML and PDFs so far and it seems pretty decent.

wrboyce

Is uvx just part of uv? I keep a few python packages around via pipx (itself via homebrew) but am a big fan of uv for python projects… Do I just need to install uv globally (via brew?) to do this? Is there a mechanism to also have the installed utils available in my PATH (so I can invoke them without a uvx prefix)?

karl42

You can install to your path with 'uv tool install'.

uvx is just an alias for 'uv tool run'.

wrboyce

Thank you! I should explore the uv docs properly.

buibuibui

Wow that is magic! I just installed uv because of your comment.

irskep

I worked on an in-house version of this feature for my employer (turning files into LLM-friendly text). After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.

There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.

I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.

themanmaran

The reason there's a lot of startups in the OCR space (us being one of them) is the classic 80/20 rule. Any solution that's 80% accurate just doesn't work for most applications.

Converting a clean .docx into markdown is 10 lines of python. But what about the same document with a screenshot of an excel file? Or complex table layouts? The .NORM files that people actually use. Definitely agree with having a toggle between rules-based/ocr. But if you're looking at company wide docs, you won't always know which to pick.

Example with one of our test files:

Input: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/Omni...

MarkItDown: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/mark...

Ours: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/omni...

The response from MarkItDown seems pretty barebones. I expected it to convert the clean pdf table element into a markdown table, but it just pulls the plaintext, which drops the header/column relationship.

irskep

> Any solution that's 80% accurate just doesn't work for most applications.

And yet people use LLMs, for which "80% accuracy" is still mostly an aspiration. :-)

I think it's reasonably likely most people companies end up using open source libraries, at least partly because it lets them avoid adding another GDPR sub-processor. Unstructured.io, one of your competitors, goes as far as having an AWS Marketplace setup so customers can use their own infrastruture but still pay them.

LLMs might get better at consuming badly-formatted data, so the data only needs to meet that minimum bar, vs the admittedly very nice output you showed.

themanmaran

> LLMs might get better at consuming badly-formatted data

Oh agreed. There's definitely a meeting in the middle between better ingestion and smarter models. LLMs are already a great fuzzing layer for that type of interpretation. And even with a perfect WYSIWYG text extraction, you're still limited by how coherent the original document was in the first place.

cosmie

From your experience, what would be the best way to handle spreadsheets?

simonw

I don't think tabular data of any sort is a particularly good fit for LLMs at the moment. What are you trying to do with it?

If you want to answer questions like "how many students does Everglade High School have?" and you have a spreadsheet of schools where one of the columns is "number of students" I guess you could feed that into an LLM, but it doesn't feel like a great tool for the job.

I'd instead use systems like ChatGPT Code Interpreter where the LLM gets to load up that data programatically and answer questions by running code against it. Text-to-SQL systems could work well for that too.

btown

This is an active area of research: https://github.com/SpursGoZmy/Awesome-Tabular-LLMs is a good starting point!

cosmie

For me personally, a lot of times it's for table augmentation purposes. Appending additional columns to a dataset, such as a cleaned/standardized version of another field, extracting a value from another field, or appending categorization attributes (sometimes pre-seeded and sometimes just giving it general direction).

Or sometimes I'll manually curate a field like that, and then ask it to generate an Excel function that can be used to produce as similar a result as possible for automated categorization in the future.

So in most cases I both want to provide it with tabular data, and also want tabular data back out. In general I've gotten decent results for these sorts of use cases, but when it falls down it's almost always addressable by tinkering with the formatting related instructions – sometimes by tweaking the input and sometimes by tweaking the instructions for the desired output.

nprateem

Give it the data as separate columns. For each cell give it the row index and the data.

That way it's just working with lists but can easily key that eg all this data is in row 3, etc. Tell it to correlate data by the first value in the pair like that.

__mharrison__

LLMs are decent at Pandas.

I say "decent" because most of the available training data for Pandas does things in a naive way.

OTOH, they are horrible at Polars. (I figure this is mostly a lack of training data.)

disgruntledphd2

> I say "decent" because most of the available training data for Pandas does things in a naive way.

They're around the level of the median user, which is pretty bad as pandas is a big and complicated API with many different approaches available (as is base R, in case people think I'm just hating on pandas).

danielmarkbruce

Many LLMs are ok with json and html tables. Not perfect, but not terrible.

simonw

I've seen enough examples of an LLM misinterpreting a column or row - resulting in returning the incorrect answer to a question because it was off by one in one of the directions - that I'm nervous about trusting them for this.

JSON objects are different - there the key/value relationship is closer in the set of tokens which usually makes it more reliable.

layer8

Markdown isn’t suitable for most spreadsheets in the first place, IMO.

irskep

The only reason I'm not immediately answering is because I need to check whether it's a trade secret. We do our own thing that I haven't seen anywhere else and works super well. Sorry for being mysterious, I'll try to get an OK to share.

Edit: yeah I can't talk about it, sorry

dragonwriter

LLM providers also let you send PDFs directly, too.

OTOH, sometimes you are the LLM provider, and you may not be using a multimodal LLM. (Or, even though feeding an LLM is a common use. You may be using the markdown for another purpose.)

Ambix

> LLMs are very bad at interpreting Markdown tables

Which table format is better for LLMs? Do you have some insights there?

btown

For PDFs it's entirely a wrapper around https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... - https://github.com/microsoft/markitdown/blob/main/src/markit...

So if that's your use case, PDFMiner might be better to integrate with directly!

persedes

or just use pymupdf

E_Bfx

pymupdf has a commercial licence that couldb be a problem if use in a compagny.

figomore

Pandoc (https://pandoc.org) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.

zamadatix

The hard part about document conversion is not finding a tool which can convert the formats but the tool which does it best. I wonder how MarkItDown ranks for the tasks for the various types.

jez

The README of MarkItDown mentions "indexing and text analysis" as the two motivating features, whereas Pandoc is more interested in document preparation via conversion that maintains rich text formatting.

Since my personal use leans towards the latter, I'm hesitant to believe that this tool will work better for me but others may have other priorities.

gbraad

MarkItDown feels like running strings; the output is great for text extraction and processing, not for reading by humans

disgruntledphd2

Yeah that was the interesting part to me, at least. Plus, it's Microsoft so hopefully it will work for their files.

_rs

That was the first thing I checked, and it looks like they’re using some existing python package to parse docx files. I wonder if they contributed to it or vetted it strongly

disgruntledphd2

Wow, I dunno if that's good or bad, certainly it's not what I expected.

LordDragonfang

...I did not catch that it was from Microsoft. I was wondering why a random markdown converter was so notable.

starkparker

I index a lot of tabletop RPG books in PDF format, which often have complex visual layouts and many tables that parsers typically have difficulty with. If this is just a wrapper around PDFMiner, as noted in another comment, I don't see any value added by this tool.

This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.

It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.

The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.

benatkin

Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise, when the idea of converting something to markdown for LLMs is floated as if it's new and the greatest thing since sliced bread. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the special handling for Wikipedia: https://github.com/microsoft/markitdown/blob/main/src/markit...

Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. https://news.ycombinator.com/item?id=42405758

Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.

zamadatix

> Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise

No surprise it has still managed to come up in the comments in spite of that!

benatkin

Yep, and that’s fine. It’s just that there is a lot of false assumptions and magical thinking going around about LLMs and Markdown and I was glad to not find any in the README.

markhneedham

Quite curious how this compares to docling - https://github.com/DS4SD/docling

docling uses an LLM IIRC, so that's already a difference in approach

phren0logy

In my use, docling has not involved an LLM. There are a few choices for OCR, but I don't think a vision model is one of them.

It's certainly touted as a solution to digest documents into plain text for LLM use, but (unless I just haven't run into that part of it) it does not employe an LLM for its functions.

ekianjo

docling does not use LLMs...

hks0

This is amazing and really useful, love the idea; but let me tell you a story, it's a bit of a tangent but relevant enough:

In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.

I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:

"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".

Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).

powersnail

> Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file

That's like 90% of the people I know outside of computer/engineering circle. Most of people probably have never opened a plaintext file in their life. They would have no idea what to do with a `.md` file.

In fact, some older engineers would not know what markdown is either, since it's only been around for two decades or so, but they can probably work with it anyway (the strength of plain text).

hks0

Exactly! Hence the "please don't try product design role" advice for me. I seem to live in an all-engineers bubble.

zelphirkalt

Engineers are people too. Engineers use products as well. Maybe you would have gone into a saner direction than most products go.

ciscodz

i got into a similar predicament helping format some course outline documents for a university friend of mine with rsi issues... I foolishly assumed a series of documents that are to be viewed online would be better as a markdown or html format... before realising by the end of the day i had unwittingly thrown myself into the gears of war between paper and digital. Modern universities are essentially an elaborate microsoft word shilling scheme with an obsession with virtual paper!

EasyMark

If you are talking about an online language class as in "I'm learning Yiddish" then I don't understand why it would confuse that that someone who isn't a coder or writer (and they're a big if) who doesn't know what the heck markdown is and hence wouldn't want to deal with it since they're used to MS Word or other word processor app. that's probably like 95% of the population at least.

hks0

It doesn't confuse anyone, quite the opposite. The irony for me was my own isolation with the non-tech folks.

LittleTimothy

This is... interesting. From my understanding - and people can correct me if I'm wrong, but didn't Microsoft spend an extremely large amount of effort essentially trying to screw people who made things like this in the 2000s? Interoperability and the Open Office movement were prety hard fought. It's kind of crazy to see MSFT do this today. Did I just misunderstand and the underlying formats (docx etc) were actually pretty friendly, or have the formats evolved a lot since then? Or is it more a case of "It doesn't matter if it looks terrible because we're feeding it to the AI beast anyway"

A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?

badlibrarian

Microsoft filed a covenant not to sue and made all the formats open ~20 years ago. A lot of people bitched at the time but there's a long list of software that supports the format now. It is complicated because the apps themselves are complicated and decades old, and imperfect because the format or app you're converting to likely doesn't support all of the features and certainly none of the quirks.

https://en.wikipedia.org/wiki/Office_Open_XML

It took browsers 15 years just to render HTML whitespace nearly consistently, so keep that in mind as you read that history.

dmonitor

I don't think that's a cynical take considering the description

> (e.g., for indexing, text analysis, etc.)

konfekt

Though it promises to convert everything to Markdown, it seems to be a worse version of what the already existing tools such as PDFtotext, docx2txt, pptx2md, ... collected [here] do without even pretending to export to Markdown. Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.

[here] https://github.com/Konfekt/vim-office [source] htps://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py

theanonymousone

Why is the repository 95% "HTML" code?

sphars

There's some very large HTML files in the test directory, including an offline version of the Microsoft Wikipedia page

caterama

And the core code mostly calls other libraries for heavy lifting -- eg `mammoth`: https://github.com/mwilliamson/python-mammoth

valbaca

tests

kepano

Never thought I'd see the day. Yet... not surprising because plain text is the ideal format for analysis, LLM training, etc.

The question businesses will start to ask is why are we putting our data into .docx files in the first place?

mdaniel

I can't tell if you're trolling or what but the idea of most business users (a) knowing markdown (b) reverting to html for the damn near infinite layout and/or styling things that markdown doesn't support (c) ignoring mail merge (d) wanting change tracking ... makes your comment laughable