Vitor Yudi Hansen: Extrair texto pdf com JAVA

terça-feira, 17 de maio de 2011

Extrair texto pdf com JAVA

Segue um método para extrair o texto de um PDF, para utilizar este método vocês devem baixar a lib PDFBox e utilizá-la no projeto. Esta é uma das funcionalidades desta lib.

public static String getConteudo(String arquivo){

File f = new File("pdf.pdf");
FileInputStream is = null;
try{
   is = new FileInputStream(f);
}catch(IOException e){
   System.out.println("ERRO: " + e.getMessage());
   return null;
}

COSDocument pdfDocument = null;

try{

   PDFParser parser = new PDFParser(is);

   parser.parse();

   pdfDocument = parser.getDocument();

   PDFTextStripper stripper = new PDFTextStripper();

    //Aqui retorna o texto

   return stripper.getText(pdfDocument);

}catch (IOException e){
   return "ERRO: Can't open stream" + e;
}catch (Throwable e){
   return "ERRO: An error occurred while getting contents from PDF" + e;
}
  finally
{
   if (pdfDocument != null)
   {
    try
    {
     pdfDocument.close();
    }catch (IOException e){
     return "ERRO: Can't close pdf." + e;
    }
   }}}

Vitor Yudi Hansen