help with using regexes so as to parse files

Dear all,
I would be grateful if you could help me with the following problem so as to extract information from the following sample file by trying to create a regular expression to match different AC/ID or more generally any Swiss-Prot lines (e.g., GN, DE, OX, …).

A sample entry of a file with swiss-prot lines is shown below:

            **ID** IPI00003881.5 IPI; PRT; 415 AA.
            **AC** IPI00003881;
            DT 01-OCT-2001 (IPI Human rel. 2.00, Created)
            DT 06-OCT-2005 (IPI Human rel. 3.11, Last sequence update)
            **DE** SIMILAR TO HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN H.
            OS **** sapiens (Human).
            OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; ****.
            **OX** NCBI_TaxID=9606;
            CC -!- GENE_LOCATION: Chr. 10:43201071-43224620:-1.
            DR UniProtKB/Swiss-Prot; P52597; HNRPF_HUMAN; -.
            DR Vega; OTTHUMP00000019482; OTTHUMG00000018029; M.
            DR Vega; OTTHUMP00000043413; OTTHUMG00000018029; -.
            DR Vega; OTTHUMP00000043414; OTTHUMG00000018029; -.
            DR ENSEMBL_HAVANA; ENSP00000348345; ENSG00000169813; -.
            DR ENSEMBL_HAVANA; ENSP00000349573; ENSG00000169813; -.
            DR ENSEMBL_HAVANA; ENSP00000363572; ENSG00000169813; -.
            DR REFSEQ_REVIEWED; NP_004957; GI:4826760; -.
            DR UniProtKB/TrEMBL; Q5T0N2; Q5T0N2_HUMAN; -.
            DR UniProtKB/TrEMBL; Q8NI96; Q8NI96_HUMAN; -.
            DR UniProtKB/TrEMBL; Q96AU2; Q96AU2_HUMAN; -.
            DR ENSEMBL; ENSP00000338477; ENSG00000169813; -.
            DR ENSEMBL; ENSP00000348345; ENSG00000169813; -.
            DR H-InvDB; HIT000003838; HIX0008779; -.
            DR H-InvDB; HIT000030409; HIX0008779; -.
            DR H-InvDB; HIT000031821; HIX0008779; -.
            DR H-InvDB; HIT000037199; HIX0008779; -.
            DR H-InvDB; HIT000037659; HIX0008779; -.
            DR UniParc; UPI0000000C5C; -; -.
            DR HGNC; 5039; HNRPF; -.
            DR Entrez Gene; 3185; HNRPF; -.
            DR UniGene; Hs.808; -; -.
            DR CCDS; CCDS7204.1; -; -.
            DR ReAlSplice protein; SL0000062; hnRNPF; factor involved in alternative splicing.
            DR trome; HTR002991; -; -.
            DR RZPD; Hs.808; -; Clones and other research material.
            DR CleanEx; HS_HNRPF; -; -.
            DR InterPro; IPR012677; a_b_plait_nuc_bd.
            DR InterPro; IPR000504; RNP1_RNA_bd.
            DR InterPro; IPR012996; Znf_CHHC.
            DR Pfam; PF00076; RRM_1; 3.
            DR Pfam; PF08080; zf-RNPHF; 1.
            DR SMART; SM00360; RRM; 3.
            DR PROSITE; PS50102; RRM; 2.
            DR GENE3D; G3D.3.30.70.330; Nucl_bd_a/b_plat; 3.
            SQ SEQUENCE 415 AA; 45672 MW; D14E170631FB1F31 CRC64;
            MMLGPEGGEG FVVKLRGLPW SCSVEDVQNF LSDCTIHDGA AGVHFIYTRE GRQSGEAFVE LGSEDDVKMA LKKDRESMGH RYIEVFKSHR TEMDWVLKHS GPNSADSAND GFVRLRGLPF GCTKEEIVQF FSGLEIVPNG ITLPVDPEGK ITGEAFVQFA SQELAEKALG KHKERIGHRY IEVFKSSQEE VRSYSDPPLK FMSVQRPGPY DRPGTARRYI GIVKQAGLER MRPGAYSTGY GGYEEYSGLS DGYGFTTDLF GRDLSYCLSG MYDHRYGDSE FTVQSTTGHC VHMRGLPYKA TENDIYNFFS PLNPVRVHIE IGPDGRVTGE ADVEFATHEE AVAAMSKDRA NMQHRYIELF LNSTTGASNG AYSSQVMQGM GVSAAQATYS GLESQSVSGC YGAGYSGQNS MGGYD
            //

I would be very grateful if you could help me,
I look forward to hearing from you,
best,
mariaig

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

grep ‘ID|AC|DE|OX’ /path/to/your/file.txt

Good luck.

On 07/29/2011 02:56 PM, mariaig wrote:
>
> Dear all, I would be grateful if you could help me with the following
> problem so as to extract information from the following sample file by
> trying to create a regular expression to match different AC/ID or more
> generally any Swiss-Prot lines (e.g., GN, DE, OX, …).
>
> A sample entry of a file with swiss-prot lines is shown below:
>
> ID IPI00003881.5 IPI; PRT; 415 AA. AC IPI00003881; DT 01-OCT-2001
> (IPI Human rel. 2.00, Created) DT 06-OCT-2005 (IPI Human rel. 3.11,
> Last sequence update) DE SIMILAR TO HETEROGENEOUS NUCLEAR
> RIBONUCLEOPROTEIN H. OS **** sapiens (Human). OC Eukaryota; Metazoa;
> Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria;
> Primates; Catarrhini; Hominidae; ****. OX NCBI_TaxID=9606; CC -!-
> GENE_LOCATION: Chr. 10:43201071-43224620:-1. DR UniProtKB/Swiss-Prot;
> P52597; HNRPF_HUMAN; -. DR Vega; OTTHUMP00000019482;
> OTTHUMG00000018029; M. DR Vega; OTTHUMP00000043413; OTTHUMG00000018029;
> -. DR Vega; OTTHUMP00000043414; OTTHUMG00000018029; -. DR
> ENSEMBL_HAVANA; ENSP00000348345; ENSG00000169813; -. DR ENSEMBL_HAVANA;
> ENSP00000349573; ENSG00000169813; -. DR ENSEMBL_HAVANA;
> ENSP00000363572; ENSG00000169813; -. DR REFSEQ_REVIEWED; NP_004957;
> GI:4826760; -. DR UniProtKB/TrEMBL; Q5T0N2; Q5T0N2_HUMAN; -. DR
> UniProtKB/TrEMBL; Q8NI96; Q8NI96_HUMAN; -. DR UniProtKB/TrEMBL; Q96AU2;
> Q96AU2_HUMAN; -. DR ENSEMBL; ENSP00000338477; ENSG00000169813; -. DR
> ENSEMBL; ENSP00000348345; ENSG00000169813; -. DR H-InvDB; HIT000003838;
> HIX0008779; -. DR H-InvDB; HIT000030409; HIX0008779; -. DR H-InvDB;
> HIT000031821; HIX0008779; -. DR H-InvDB; HIT000037199; HIX0008779; -.
> DR H-InvDB; HIT000037659; HIX0008779; -. DR UniParc; UPI0000000C5C; -;
> -. DR HGNC; 5039; HNRPF; -. DR Entrez Gene; 3185; HNRPF; -. DR UniGene;
> Hs.808; -; -. DR CCDS; CCDS7204.1; -; -. DR ReAlSplice protein;
> SL0000062; hnRNPF; factor involved in alternative splicing. DR trome;
> HTR002991; -; -. DR RZPD; Hs.808; -; Clones and other research
> material. DR CleanEx; HS_HNRPF; -; -. DR InterPro; IPR012677;
> a_b_plait_nuc_bd. DR InterPro; IPR000504; RNP1_RNA_bd. DR InterPro;
> IPR012996; Znf_CHHC. DR Pfam; PF00076; RRM_1; 3. DR Pfam; PF08080;
> zf-RNPHF; 1. DR SMART; SM00360; RRM; 3. DR PROSITE; PS50102; RRM; 2. DR
> GENE3D; G3D.3.30.70.330; Nucl_bd_a/b_plat; 3. SQ SEQUENCE 415 AA; 45672
> MW; D14E170631FB1F31 CRC64; MMLGPEGGEG FVVKLRGLPW SCSVEDVQNF LSDCTIHDGA
> AGVHFIYTRE GRQSGEAFVE LGSEDDVKMA LKKDRESMGH RYIEVFKSHR TEMDWVLKHS
> GPNSADSAND GFVRLRGLPF GCTKEEIVQF FSGLEIVPNG ITLPVDPEGK ITGEAFVQFA
> SQELAEKALG KHKERIGHRY IEVFKSSQEE VRSYSDPPLK FMSVQRPGPY DRPGTARRYI
> GIVKQAGLER MRPGAYSTGY GGYEEYSGLS DGYGFTTDLF GRDLSYCLSG MYDHRYGDSE
> FTVQSTTGHC VHMRGLPYKA TENDIYNFFS PLNPVRVHIE IGPDGRVTGE ADVEFATHEE
> AVAAMSKDRA NMQHRYIELF LNSTTGASNG AYSSQVMQGM GVSAAQATYS GLESQSVSGC
> YGAGYSGQNS MGGYD //
>
> I would be very grateful if you could help me, I look forward to
> hearing from you, best, mariaig
>
>


Want to yell at me in person?
Come to BrainShare 2011 in October: http://tinyurl.com/brainshare2011
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJOMz6sAAoJEF+XTK08PnB5U70QALAxtq7yc+JlTzvgz7YnxKR7
p4xMezp8rn3bQCta4bUjuTZ/AvTWzSNk31npA8vREhpGlUHMp2C+taX7AeBtsaha
sir9SY/dgD+68L15QS2Dnh34sv9eJ/24KaJVoo2RYys1uVVHRlp6XekNVVGK5zlz
DTa1q+INEkwpAhuaGZ4u0Le7ZXtTsfYc+NOoZd8A3XawTs+wjEuT57Qza5F3Hg0A
yiPPIk/s052t1Fnaofh8OeUmrq/GPGZMP/OF1f+69TgSqAhWAEyKn9aNPcIAFEzX
PIKMaEI5/nBm1Msff4/S87XM783/AnwpxBe+DpL7S73GcTdvYC5jkAzQOh5evOFu
YctaLiq7eiDw35UkkC3k/m+/RZzTdV1ijgBbM0yxC556QXRQIksVOv/JkJN6mKV8
FF8De6m6dSZpd+/r7caSGiEJ76Wv3+K2LuyPA1K8wSrNmUbqsGZGxIg4KSdZoK6m
faHO0MXTCjH/JFcKuhFJv+kqNxjQ3vQmKZcRXXhw+DRw9WF6Xe/1KepGwxc0RTzX
GtYlZbN6KEJ0ggvpeAuXOXdjxvsuwakwGjurGw+hkJWj5QAuL9TCRCpSC15WAA1v
mRKWXR3fS/hVxW83OM+wcNIYUHdRZpt6rocs1EswV/8Cmt7mwWSFY5yImBofgaeE
2FX58Lpud2urtxmznJpD
=fJs/
-----END PGP SIGNATURE-----