DOI Parsing (was

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

DOI Parsing (was

jefferis
Hi Miguel,

For DOI parsing I'm afraid what I've put together is really a dirty great
hack, but it works for me.  You can take a look at the perl here:

http://pastie.org/334440

The core DOI regexes are:

if($page=~/doi[: ]+([0-9.]+[ \/][A-Z0-9.\-_]+)/im){
    $doicand=$1;
    $doicand=~s/\s+/\//;
    $pmid=efetch($doicand);
} elsif($page=~/doi[: ]+(10\.[0-9]{4})[ \/0]([A-Z0-9.\-_]+)/im){
    # Be more restrictive about initial part but less about
    # actual DOI string - offer 3 alternatives for 'hinge'
    # including standard slash
    $doicand=$1."/".$2;
    $pmid=efetch($doicand);
}                

However to speed things up, the script first looks at the file name, the pdf
metadata and eventually the text (using pdftotext) looking for dois to
identify enough information to pull up the record from PubMed. These days it
works fine with all my journals (I'm a neuroscientist).

Best,

Greg.

PS Let me know if you would like the full app (just a wrapper script to
handle drag and drop of PDFs + pdftotext pinched from inside Bibdesk).



------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Bibdesk-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bibdesk-users
Reply | Threaded
Open this post in threaded view
|

Re: DOI Parsing (was

Miguel Ortiz Lombardía
Hi Greg,

Thank you for your e-mail. I see that your code is very ad-hoc, like  
mine. In fact, it's quite similar. It doesn't directly solve all my  
problems but you gave me two good ideas:
1. To look first in the metadata
2. To search as well with PII (DOIs don't work when they include  
parenthesis, as in elsevier journals, and it looks easy to convert a  
DOI into a PII)

Thanks again!


Miguel

Le 9 Dec 2008 à 00:15, Gregory Jefferis a écrit :

> Hi Miguel,
>
> For DOI parsing I'm afraid what I've put together is really a dirty  
> great
> hack, but it works for me.  You can take a look at the perl here:
>
> http://pastie.org/334440
>
> The core DOI regexes are:
>
> if($page=~/doi[: ]+([0-9.]+[ \/][A-Z0-9.\-_]+)/im){
>    $doicand=$1;
>    $doicand=~s/\s+/\//;
>    $pmid=efetch($doicand);
> } elsif($page=~/doi[: ]+(10\.[0-9]{4})[ \/0]([A-Z0-9.\-_]+)/im){
>    # Be more restrictive about initial part but less about
>    # actual DOI string - offer 3 alternatives for 'hinge'
>    # including standard slash
>    $doicand=$1."/".$2;
>    $pmid=efetch($doicand);
> }
>
> However to speed things up, the script first looks at the file name,  
> the pdf
> metadata and eventually the text (using pdftotext) looking for dois to
> identify enough information to pull up the record from PubMed. These  
> days it
> works fine with all my journals (I'm a neuroscientist).
>
> Best,
>
> Greg.
>
> PS Let me know if you would like the full app (just a wrapper script  
> to
> handle drag and drop of PDFs + pdftotext pinched from inside Bibdesk).
>
>
>
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,  
> Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to  
> help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> Bibdesk-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/bibdesk-users
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>

--
Miguel Ortiz Lombardía
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!                  NEW ADDRESS                    !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Architecture et Fonction des Macromolécules Biologiques
UMR6098, CNRS, Université Aix-Marseille I & II
Case 932
163 Avenue de Luminy
13288 Marseille cedex 9
France
Tel : +33(0) 491 82 55 93
Fax: +33(0) 491 26 67 20
e-mail: [hidden email]
Web: http://www.pangea.org/mol/spip.php?rubrique2


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Bibdesk-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bibdesk-users