使用来自pdf文件的iTextSharp提取RightToLeft langueges的字符串-Java 学习之路

我搜索找到一个用iTextSharp提取RightToLeft语言字符串的解决方案，但我找不到任何方法 . 是否可以使用iTextSharp从pdf文件中提取RightToLeft语言的字符串？谢谢

编辑：此代码有非常好的结果：

private void writePdf2()
    {
        using (var document = new Document(PageSize.A4))
        {
            var writer = PdfWriter.GetInstance(document, new FileStream(@"C:\Users\USER\Desktop\Test2.pdf", FileMode.Create));
            document.Open();

            FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
            var tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);


            var reader = new PdfReader(@"C:\Users\USER\Desktop\Test.pdf");
            int intPageNum = reader.NumberOfPages;
            string text = null;
            for (int i = 1; i <= intPageNum; i++)
            {
                text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
                text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
                text = new UnicodeCharacterPlacement
                {
                    Font = new System.Drawing.Font("Tahoma", 12)
                }.Apply(text);

                File.WriteAllText("page-" + i + "-text.txt", text.ToString());
            }
            reader.Close();
            ColumnText.ShowTextAligned(
                        canvas: writer.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
                        //phrase: new Phrase(new Chunk(text, tahoma)),
                        x: 300,
                        y: 300,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);
        }

        System.Diagnostics.Process.Start(@"C:\Users\USER\Desktop\Test2.pdf");
    }

但是“短语：新词组（新词块（文本，tahoma））”对于PDF中的所有字符串都没有正确的输出 . 因此，我使用“PdfStamper”制作适合“iTextSharp”中“PdfReader”的PDF .

1 回答

重现问题

由于最初OP无法提供样本文件，我首先尝试使用iTextSharp本身生成的文件重现该问题 .

我的测试方法首先使用带有字符串常量的 ColumnText.ShowTextAligned 创建PDF，根据OP返回一个好的结果 . 然后它提取该文件的文本内容 . 最后，它创建了第二个PDF，其中包含使用带有字符串常量的良好 ColumnText.ShowTextAligned 调用创建的行，然后使用带有提取字符串的 ColumnText.ShowTextAligned 创建的几行，带有或不带有OP代码的后处理指令（UTF8编码和-decoding;应用 UnicodeCharacterPlacement ）执行 .

我无法立即找到OP使用的 UnicodeCharacterPlacement 类 . 所以我google了一下，发现了一个这样的类here . 我希望这基本上是OP使用的类 .

public void ExtractTextLikeUser2509093()
{
    string rtlGood = @"C:\Temp\test-results\extract\rtlGood.pdf";
    string rtlGoodExtract = @"C:\Temp\test-results\extract\rtlGood.txt";
    string rtlFinal = @"C:\Temp\test-results\extract\rtlFinal.pdf";
    Directory.CreateDirectory(@"C:\Temp\test-results\extract\");

    FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
    Font tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);

    // A - Create a PDF with a good RTL representation
    using (FileStream fs = new FileStream(rtlGood, FileMode.Create, FileAccess.Write, FileShare.None))
    {
        using (Document document = new Document())
        {
            PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
            document.Open();

            ColumnText.ShowTextAligned(
                        canvas: pdfWriter.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
                        x: 500,
                        y: 300,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);
        }
    }

    // B - Extract the text for that good representation and add it to a new PDF
    String textA, textB, textC, textD;
    using (PdfReader pdfReader = new PdfReader(rtlGood))
    {
        textA = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new LocationTextExtractionStrategy());
        textB = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(textA.ToString()));
        textC = new UnicodeCharacterPlacement
        {
            Font = new System.Drawing.Font("Tahoma", 12)
        }.Apply(textA);
        textD = new UnicodeCharacterPlacement
        {
            Font = new System.Drawing.Font("Tahoma", 12)
        }.Apply(textB);

        File.WriteAllText(rtlGoodExtract, textA + "\n\n" + textB + "\n\n" + textC + "\n\n" + textD + "\n\n");
    }
    using (FileStream fs = new FileStream(rtlFinal, FileMode.Create, FileAccess.Write, FileShare.None))
    {
        using (Document document = new Document())
        {
            PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
            document.Open();

            ColumnText.ShowTextAligned(
                        canvas: pdfWriter.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
                        x: 500,
                        y: 600,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);

            ColumnText.ShowTextAligned(
                        canvas: pdfWriter.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk(textA, tahoma)),
                        x: 500,
                        y: 550,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);

            ColumnText.ShowTextAligned(
                        canvas: pdfWriter.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk(textB, tahoma)),
                        x: 500,
                        y: 500,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);

            ColumnText.ShowTextAligned(
                        canvas: pdfWriter.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk(textC, tahoma)),
                        x: 500,
                        y: 450,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);

            ColumnText.ShowTextAligned(
                        canvas: pdfWriter.DirectContent,
                        alignment: Element.ALIGN_RIGHT,
                        phrase: new Phrase(new Chunk(textD, tahoma)),
                        x: 500,
                        y: 400,
                        rotation: 0,
                        runDirection: PdfWriter.RUN_DIRECTION_RTL,
                        arabicOptions: 0);
        }
    }
}

最终结果：

rtlFinal.pdf

从而，

我无法重现这个问题 . 对我来说，最后两个变体的阿拉伯语内容与原始行相同 . 特别是我无法观察到从"سلام"到"سالم"的切换 . 最有可能的PDF C:\Users\USER\Desktop\Test.pdf （OP从他的测试中提取文本）的内容在某种程度上是特殊的，因此从中提取的文本用该开关绘制 .
将 UnicodeCharacterPlacement 类应用于提取的文本是必要的，以使其按正确的顺序排列 .
另一条后处理线，

text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));

没有任何区别，不应该使用 .

为了进一步分析，我们需要PDF C:\Users\USER\Desktop\Test.pdf .

检查salamword.pdf

最终OP可以提供PDF样本，salamword.pdf：

我使用“PrimoPDF”创建一个包含以下内容的PDF文件：“测试.Hello world.Hello people . سلام . کلمهسلام . سلاممردم” . 接下来我读了这个PDF文件 . 然后我收到了这个输出：“测试 . 你好世界 . 你好人 . م . م . مدد” .

确实，我可以重现这种行为 . 所以我分析了阿拉伯文写作在里面编码的方式......

Some background information to start with:

PDF中的字体可以具有完全自定义编码（在手头的情况下） . 特别是嵌入式子集通常是通过在字符到来时选择代码来生成的，例如 . 页面上使用的给定字体中的第一个字符编码为 1 ，第二个字符编码为 2 ，第三个字符编号为 3 等 .

因此，简单地提取绘制文本的代码根本没有多大帮助（参见下面的手头文件中的示例） . 但PDF中的字体可以带来一些额外的信息，允许提取器将代码映射到Unicode值 . 这些信息可能是

a ToUnicode map提供即时映射代码 - > Unicode代码点;
an Encoding 以字形名称的形式提供基本编码（例如 WinAnsiEncoding ）及其差异;这些名称可能是标准名称或仅在手头字体的上下文中有意义的名称;
ActualText 结构元素或标记内容序列的条目 .

PDF规范描述了一种方法，该方法使用带有标准名称的 ToUnicode 和 Encoding 信息从PDF中提取文本，并在适用的情况下提供 ActualText 作为替代方法 . iTextSharp文本提取代码使用标准名称实现 ToUnicode / Encoding 方法 .

PDF规范中此上下文中的标准名称是从Adobe标准拉丁字符集中获取的字符名称和符号字体中的命名字符集 .

In the file at hand:

让我们看一下用Arial编写的行中的阿拉伯语文本 . 这里用于字形的代码是：

01 02 03 04 05 01 02 06 07 01 08 02 06 07 01 09 05 0A 0B 01 08 02 06 07

这看起来非常像使用如上所述的ad-hoc编码 . 因此，仅使用这些信息根本没有帮助 .

因此，让我们看一下嵌入式Arial子集的 ToUnicode 映射：

<01><01><0020>
<02><02><0645>
<03><03><062f>
<04><04><0631>
<08><08><002e>
<0c><0c><0028>
<0d><0d><0077>
<0e><0e><0069>
<0f><0f><0074>
<10><10><0068>
<11><11><0041>
<12><12><0072>
<13><13><0061>
<14><14><006c>
<15><15><0066>
<16><16><006f>
<17><17><006e>
<18><18><0029>

这将 01 映射到 0020 ， 02 至 0645 ， 03 至 062f ， 04 至 0631 ， 08 至 002e 等 . 但它不会将 05 ， 06 ， 07 等映射到任何东西 .

因此 ToUnicode Map 仅对某些代码有帮助 .

现在让我们看一下相关的编码

29 0 obj
<</Type/Encoding
  /BaseEncoding/WinAnsiEncoding
  /Differences[ 1
    /space/uni0645/uni062F/uni0631
    /uni0645.init/uni06440627.fina/uni0633.init/period
    /uni0647.fina/uni0644.medi/uni06A9.init/parenleft
    /w/i/t/h
    /A/r/a/l
    /f/o/n/parenright ]
>>
endobj

编码基于 WinAnsiEncoding ，但所有感兴趣的代码都在 Differences 中重新映射 . 在那里我们发现了许多标准的字形名称（即从Adobe标准拉丁字符集中获取的字符名称和符号字体中的命名字符集），如 space ，_ 279595， w ， i ， t 等;但我们也发现了几个非标准名称，如 uni0645 ， uni06440627.fina 等 .

似乎有一个用于这些名称的方案， uni0645 表示Unicode代码点0645处的字符，而 uni06440627.fina 很可能代表Unicode代码点0644和0627处的某些最终形式的某些顺序的字符 . 但是，根据PDF规范提供的方法，这些名称对于文本提取仍然是非标准的 .

此外，文件中根本没有 ActualText 个条目 .

因此，仅提取“م . م . مدد”的原因是，仅对于这些字形，PDF中的标准PDF文本提取方法有适当的信息 .

顺便说一句，如果您从Adobe Reader中的文件中复制和粘贴，您将获得类似的结果，并且Adobe Reader具有相当好的标准文本提取方法的实现 .

TL;DR

样本文件不包含使用PDF规范描述的方法提取文本所需的信息，这是iTextSharp实现的方法 .

回复于 2024-04-20T07:59:20+08:00

使用来自pdf文件的iTextSharp提取RightToLeft langueges的字符串

1 回答

重现问题

检查salamword.pdf

相关问题