[okular] [Bug 395660] New: okular cannot preserve annotations in some pdf files.

Discussion:

iuri soter viana segtovich

2018-06-20 16:11:11 UTC

https://bugs.kde.org/show_bug.cgi?id=395660

Bug ID: 395660
Summary: okular cannot preserve annotations in some pdf files.
Product: okular
Version: unspecified
Platform: Other
OS: Linux
Status: UNCONFIRMED
Severity: normal
Priority: NOR
Component: PDF backend
Assignee: okular-***@kde.org
Reporter: ***@gmail.com
Target Milestone: ---

Hi, I noticed a problem when annotating pdf files with okular and noticed that
it only occurred with the pdf files that I had processed in pdfsam. I thought
it might concern the okular developers, but I have posted this issue on the
pdfsam issue tracker as well. The problem is described as follows:

* pdf sam versions 3.3.2, when using the Split tool, creates pdf files which
okular 0.24.2 can open and annotate in, but okular cannot preserve annotations
when using the "save-as" or "export to archive" functions in these files.
* okular worked fine annotating and preserving annotations in the original
file, before splitting with pdfsam.
* the problem persisted on updating pdfsam to 3.3.5.

--
You are receiving this mail because:
You are the assignee for the bug.

iuri soter viana segtovich

2018-06-20 16:15:27 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

--- Comment #1 from iuri soter viana segtovich <***@gmail.com> ---
Created attachment 113468
--> https://bugs.kde.org/attachment.cgi?id=113468&action=edit
a original pdf created in libreoffice and two pages splited using pdfsam

uploading a "good" and a "bad" pdf file.
okular can annotate in the original file (Untitled 2.pdf) and preserve
annotations upon "save as" or "export to archive".
okular can annotate on the files processed in pdfsam, but these are not
preserved upon "save as" or "export to archive".

--
You are receiving this mail because:
You are the assignee for the bug.

Oliver Sander

2018-06-21 07:13:35 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

Oliver Sander <***@tu-dresden.de> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@tu-dresden.de

--- Comment #2 from Oliver Sander <***@tu-dresden.de> ---
Hi Iuri,

okular 0.24.2 is quite old. There are reasons to believe that new versions
handle these annotations better. Could you please try that first?

Thanks,
Oliver

--
You are receiving this mail because:
You are the assignee for the bug.

Albert Astals Cid

2018-06-21 09:13:40 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

Albert Astals Cid <***@kde.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@kde.org

--- Comment #3 from Albert Astals Cid <***@kde.org> ---
It does fail on current version, would need someone to investigate why,
probably a poppler issue

--
You are receiving this mail because:
You are the assignee for the bug.

Tobias Deiminger

2018-06-21 22:08:06 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

Tobias Deiminger <***@posteo.de> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@posteo.de

--- Comment #4 from Tobias Deiminger <***@posteo.de> ---
(In reply to Albert Astals Cid from comment #3)

Post by Albert Astals Cid
It does fail on current version, would need someone to investigate why,
probably a poppler issue

I reproduced the error with a standalone poppler application, to rule out
errors in Okular. Poppler immediatelly gave first hints about what's wrong:
"Error: Couldn't find trailer dictionary"
"Error: Invalid XRef entry"

Looking a bit deeper, it are two characteristics of 'Untitled 1.pdf' that make
poppler fail
- The document has an "XRef stream", instead of a "XRef table". XRef streams
are available since PDF 1.5 and legitimately have no "trailer" keyword.
- The first object in the XRef stream is 1 (see "17 0 obj <<... /Index [1 17]
...>>", instead of 0.

The start-at-1 thing causes XRef::entries[0].type = xrefEntryNone (see
initialization in XRef::resize).

Then, upon document save, PDFDoc::saveIncrementalUpdate iterates over entires
ranging from 0 to (getNumObjects-1). Accessing entries[0] where type ==
xrefEntryNone causes poppler to think this is a damaged file and it tries to
reconstruct the xref table with XRef::constructXRef. Now XRef::constructXRef
wants a "trailer" keyword. But there is no "trailer" keyword in the file
(that's not an error because we've got a PDF 1.5 XRef stream). But
XRef::constructXRef can't work without, and bails out with error.

I believe there are two things to fix in poppler:
- XRef::constructXRef should support PDF 1.5 XRef streams without trailer
dictionary.
- First object number > 0 doesn't indicate a damaged file, but it's valid (am
unsure about this). No need to reconstruct XRef at all. Actually, everything
works fine if I trick poppler to start iteration at 1 in saveIncrementalUpdate.

There's no problem with the second document 'Untitled 2.pdf', because it uses
XRef table with trailer dictionary and has objects 0..22.

Albert, does this sound reasonable? This was my first play on XRef, so the
observation my be somewhat wrong. Anyway, we should open a bug at poppler.

--
You are receiving this mail because:
You are the assignee for the bug.

Tobias Deiminger

2018-06-24 21:41:08 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

--- Comment #5 from Tobias Deiminger <***@posteo.de> ---
(In reply to Tobias Deiminger from comment #4)

Post by Tobias Deiminger
- First object number > 0 doesn't indicate a damaged file, but it's valid
(am unsure about this)

After investigating a bit more, now I think not having an object 0 is invalid.
This would mean '1_PDFsam_Untitled 2.pdf' is invalid, and poppler is NOT to
blame (maybe poppler could provide a workaround, though).

Standard section 7.5.4 is explicit that an old fashioned XRef table needs a
special object 0:
"The first entry in the table (object number 0) shall always be free and shall
have a generation number of 65,535; it is shall be the head of the linked list
of free objects."

Now '1_PDFsam_Untitled 2.pdf' has no XRef table but an XRef stream, and it
seems a bit ambigous if the above statement about object 0 applies for XRef
streams too. This needs to be clarified before we can actually blame either
poppler or pdfsam. Maybe ask at adobe forum, or poppler list?

The XRef stream in '1_PDFsam_Untitled 2.pdf' looks like this (needed to decode
/Filter /FlateDecode first)

$ dd if=1_PDFsam_Untitled\ 2.pdf ibs=1 skip=5841 count=64 | python -c 'import
zlib;import sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))' | hexdump
-e '5/1 " %02X" "\n"'
01 13 73 00 00 # Object 1. Type 1 (used, not compressed), object offset =
0x1373, generation 0
01 00 0F 00 00 # Object 2. Type 1 (used, not compressed), object offset = 0xf,
generation 0
02 00 01 00 00 # Object 3. Type 2 (compressed), stored in object nr.1, index
in object stream 0
02 00 01 00 01 # Object 4. Type 2 (compressed), stored in object nr.1, index
in object stream 1
02 00 01 00 02 # Object 5. Type 2 (compressed), stored in object nr.1, index
in object stream 2
02 00 01 00 03 # Object 6. Type 2 (compressed), stored in object nr.1, index
in object stream 3
02 00 01 00 04 # Object 7. Type 2 (compressed), stored in object nr.1, index
in object stream 4
02 00 01 00 05 # Object 8. Type 2 (compressed), stored in object nr.1, index
in object stream 5
02 00 01 00 06 # Object 9. Type 2 (compressed), stored in object nr.1, index
in object stream 6
01 00 6F 00 00 # Object 10. Type 1 (used, not compressed), object offset =
0x6f, generation 0
02 00 01 00 07 # Object 11. Type 2 (compressed), stored in object nr.1, index
in object stream 7
02 00 01 00 08 # Object 12. Type 2 (compressed), stored in object nr.1, index
in object stream 8
02 00 01 00 09 # Object 13. Type 2 (compressed), stored in object nr.1, index
in object stream 9
01 01 0D 00 00 # Object 14. Type 1 (used, not compressed), object offset =
0x10d, generation 0
01 02 3E 00 00 # Object 15. Type 1 (used, not compressed), object offset =
0x23e, generation 0
01 15 ED 00 00 # Object 16. Type 1 (used, not compressed), object offset =
0x15ed, generation 0
01 16 01 00 00 # Object 17. Type 1 (used, not compressed), object offset =
0x1601, generation 0

You see, no special object 0 here. It would look something like this
00 00 00 FF FF # Object 0. Type 0 (member of linked list of free objects),
generation nr. 65535

--
You are receiving this mail because:
You are the assignee for the bug.

Albert Astals Cid

2018-06-24 23:02:32 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

--- Comment #6 from Albert Astals Cid <***@kde.org> ---
I think the important question is, does Adobe Reader let you save stuff in that
broken file? If so we should try to do the same, and if we can't make it happen
i guess we'd need some kind of visual warning (we have one in the command line
when saving fails, but that's hardly enough)

--
You are receiving this mail because:
You are the assignee for the bug.

Tobias Deiminger

2018-06-25 21:05:41 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

--- Comment #7 from Tobias Deiminger <***@posteo.de> ---
(In reply to Albert Astals Cid from comment #6)

Post by Albert Astals Cid
I think the important question is, does Adobe Reader let you save stuff in
that broken file?

Yes, Adobe Reader can save annotations in '1_PDFsam_Untitled 1.pdf'. Okular can
view the saved file afterwards. Details see below.

Post by Albert Astals Cid
If so we should try to do the same, and if we can't make
it happen i guess we'd need some kind of visual warning (we have one in the
command line when saving fails, but that's hardly enough)

Nothing is impossible:) I'd take it as learning story, with open end and no
guarantees. As this may take a looooong time, let's better add the visual
warning as interim solution. Or are there some experienced poppler guys out
there to join?

Some details.

On full rewrite ("Save As..."), Adobe Reader created a new XRef stream for
objects 0..13. So there was an object 0 after save.

On incremental update ("Save"), Adobe Reader instead added a new XRef stream
with /Index[2 2 6 1 18 11] to the end of the file.
The original XRef stream with /Index [1 17] was preserved. In that case there
was still no object 0 after save.

The content of the full rewrite XRef looked as follows
$ dd if='1_PDFsam_Untitled 1.pdf' ibs=1 skip=12306 count=52 |
./unpredict_png.py | hexdump -e '4/1 " %02X" "\n"'
00 00 00 00 # obj 0 free, next free object = 0, use gen 0 if reused
01 1D FB 00
01 20 D8 00
01 2D 8A 00
01 2E 59 00
01 2F 3E 00
02 00 01 00
02 00 01 01
02 00 01 02
02 00 01 03
02 00 03 00
02 00 03 01
02 00 03 02
02 00 04 00

Adobe saves the stream with /DecodeParms<</Columns 4/Predictor 12>>
/Filter/FlateDecode.
So to analyze it, one has to decode and unpredict the PNG prediction first. I
used this quick and dirty python script:

Listing unpredict_png.py

#!/usr/bin/python3
import zlib
import sys
predicted = zlib.decompress(sys.stdin.buffer.read())
rows = [predicted[i+1:i+5] for i in range(0, len(predicted), 5)]
prev = bytearray(4)
for row in range(len(rows)):
for byte in range(len(rows[row])):
prev[byte] = (rows[row][byte] + prev[byte]) & 0xFF
sys.stdout.buffer.write(prev)

--
You are receiving this mail because:
You are the assignee for the bug.

Nate Graham

2018-06-27 17:29:28 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

Nate Graham <***@kde.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@kde.org

--
You are receiving this mail because:
You are the assignee for the bug.

Tobias Deiminger

2018-06-27 21:33:55 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

--- Comment #8 from Tobias Deiminger <***@posteo.de> ---
(In reply to Tobias Deiminger from comment #7)

Post by Tobias Deiminger
guarantees. As this may take a looooong time, let's better add the visual
warning as interim solution.

Probably it's not that bad, here's a poppler patch:
https://bugs.freedesktop.org/show_bug.cgi?id=107057
It's sufficient to fix the bug, if approach is valid.

Another related poppler issue would be to support XRef streams, and discovery
of objects inside object streams in XRef::constructXRef. I did some
experiments, partially working, but it's more difficult and I'm not sure if
it's worth the while.

--
You are receiving this mail because:
You are the assignee for the bug.

Albert Astals Cid

2018-11-21 22:32:25 UTC

Permalink

https://bugs.kde.org/show_bug.cgi?id=395660

Albert Astals Cid <***@kde.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|REPORTED |RESOLVED
Resolution|--- |FIXED

--- Comment #9 from Albert Astals Cid <***@kde.org> ---
I'm going to close assuming that
https://gitlab.freedesktop.org/poppler/poppler/issues/139 fixed it.

Tobias please complain if it isn't correct.

--
You are receiving this mail because:
You are the assignee for the bug.