Dear all,
I would be grateful if you could help me with the following problem so as to extract information from the following sample file by trying to create a regular expression to match different AC/ID or more generally any Swiss-Prot lines (e.g., GN, DE, OX, …).
A sample entry of a file with swiss-prot lines is shown below:
**ID** IPI00003881.5 IPI; PRT; 415 AA.
**AC** IPI00003881;
DT 01-OCT-2001 (IPI Human rel. 2.00, Created)
DT 06-OCT-2005 (IPI Human rel. 3.11, Last sequence update)
**DE** SIMILAR TO HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN H.
OS **** sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; ****.
**OX** NCBI_TaxID=9606;
CC -!- GENE_LOCATION: Chr. 10:43201071-43224620:-1.
DR UniProtKB/Swiss-Prot; P52597; HNRPF_HUMAN; -.
DR Vega; OTTHUMP00000019482; OTTHUMG00000018029; M.
DR Vega; OTTHUMP00000043413; OTTHUMG00000018029; -.
DR Vega; OTTHUMP00000043414; OTTHUMG00000018029; -.
DR ENSEMBL_HAVANA; ENSP00000348345; ENSG00000169813; -.
DR ENSEMBL_HAVANA; ENSP00000349573; ENSG00000169813; -.
DR ENSEMBL_HAVANA; ENSP00000363572; ENSG00000169813; -.
DR REFSEQ_REVIEWED; NP_004957; GI:4826760; -.
DR UniProtKB/TrEMBL; Q5T0N2; Q5T0N2_HUMAN; -.
DR UniProtKB/TrEMBL; Q8NI96; Q8NI96_HUMAN; -.
DR UniProtKB/TrEMBL; Q96AU2; Q96AU2_HUMAN; -.
DR ENSEMBL; ENSP00000338477; ENSG00000169813; -.
DR ENSEMBL; ENSP00000348345; ENSG00000169813; -.
DR H-InvDB; HIT000003838; HIX0008779; -.
DR H-InvDB; HIT000030409; HIX0008779; -.
DR H-InvDB; HIT000031821; HIX0008779; -.
DR H-InvDB; HIT000037199; HIX0008779; -.
DR H-InvDB; HIT000037659; HIX0008779; -.
DR UniParc; UPI0000000C5C; -; -.
DR HGNC; 5039; HNRPF; -.
DR Entrez Gene; 3185; HNRPF; -.
DR UniGene; Hs.808; -; -.
DR CCDS; CCDS7204.1; -; -.
DR ReAlSplice protein; SL0000062; hnRNPF; factor involved in alternative splicing.
DR trome; HTR002991; -; -.
DR RZPD; Hs.808; -; Clones and other research material.
DR CleanEx; HS_HNRPF; -; -.
DR InterPro; IPR012677; a_b_plait_nuc_bd.
DR InterPro; IPR000504; RNP1_RNA_bd.
DR InterPro; IPR012996; Znf_CHHC.
DR Pfam; PF00076; RRM_1; 3.
DR Pfam; PF08080; zf-RNPHF; 1.
DR SMART; SM00360; RRM; 3.
DR PROSITE; PS50102; RRM; 2.
DR GENE3D; G3D.3.30.70.330; Nucl_bd_a/b_plat; 3.
SQ SEQUENCE 415 AA; 45672 MW; D14E170631FB1F31 CRC64;
MMLGPEGGEG FVVKLRGLPW SCSVEDVQNF LSDCTIHDGA AGVHFIYTRE GRQSGEAFVE LGSEDDVKMA LKKDRESMGH RYIEVFKSHR TEMDWVLKHS GPNSADSAND GFVRLRGLPF GCTKEEIVQF FSGLEIVPNG ITLPVDPEGK ITGEAFVQFA SQELAEKALG KHKERIGHRY IEVFKSSQEE VRSYSDPPLK FMSVQRPGPY DRPGTARRYI GIVKQAGLER MRPGAYSTGY GGYEEYSGLS DGYGFTTDLF GRDLSYCLSG MYDHRYGDSE FTVQSTTGHC VHMRGLPYKA TENDIYNFFS PLNPVRVHIE IGPDGRVTGE ADVEFATHEE AVAAMSKDRA NMQHRYIELF LNSTTGASNG AYSSQVMQGM GVSAAQATYS GLESQSVSGC YGAGYSGQNS MGGYD
//
I would be very grateful if you could help me,
I look forward to hearing from you,
best,
mariaig
On 07/29/2011 02:56 PM, mariaig wrote:
>
> Dear all, I would be grateful if you could help me with the following
> problem so as to extract information from the following sample file by
> trying to create a regular expression to match different AC/ID or more
> generally any Swiss-Prot lines (e.g., GN, DE, OX, …).
>
> A sample entry of a file with swiss-prot lines is shown below:
>
> ID IPI00003881.5 IPI; PRT; 415 AA. AC IPI00003881; DT 01-OCT-2001
> (IPI Human rel. 2.00, Created) DT 06-OCT-2005 (IPI Human rel. 3.11,
> Last sequence update) DE SIMILAR TO HETEROGENEOUS NUCLEAR
> RIBONUCLEOPROTEIN H. OS **** sapiens (Human). OC Eukaryota; Metazoa;
> Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria;
> Primates; Catarrhini; Hominidae; ****. OX NCBI_TaxID=9606; CC -!-
> GENE_LOCATION: Chr. 10:43201071-43224620:-1. DR UniProtKB/Swiss-Prot;
> P52597; HNRPF_HUMAN; -. DR Vega; OTTHUMP00000019482;
> OTTHUMG00000018029; M. DR Vega; OTTHUMP00000043413; OTTHUMG00000018029;
> -. DR Vega; OTTHUMP00000043414; OTTHUMG00000018029; -. DR
> ENSEMBL_HAVANA; ENSP00000348345; ENSG00000169813; -. DR ENSEMBL_HAVANA;
> ENSP00000349573; ENSG00000169813; -. DR ENSEMBL_HAVANA;
> ENSP00000363572; ENSG00000169813; -. DR REFSEQ_REVIEWED; NP_004957;
> GI:4826760; -. DR UniProtKB/TrEMBL; Q5T0N2; Q5T0N2_HUMAN; -. DR
> UniProtKB/TrEMBL; Q8NI96; Q8NI96_HUMAN; -. DR UniProtKB/TrEMBL; Q96AU2;
> Q96AU2_HUMAN; -. DR ENSEMBL; ENSP00000338477; ENSG00000169813; -. DR
> ENSEMBL; ENSP00000348345; ENSG00000169813; -. DR H-InvDB; HIT000003838;
> HIX0008779; -. DR H-InvDB; HIT000030409; HIX0008779; -. DR H-InvDB;
> HIT000031821; HIX0008779; -. DR H-InvDB; HIT000037199; HIX0008779; -.
> DR H-InvDB; HIT000037659; HIX0008779; -. DR UniParc; UPI0000000C5C; -;
> -. DR HGNC; 5039; HNRPF; -. DR Entrez Gene; 3185; HNRPF; -. DR UniGene;
> Hs.808; -; -. DR CCDS; CCDS7204.1; -; -. DR ReAlSplice protein;
> SL0000062; hnRNPF; factor involved in alternative splicing. DR trome;
> HTR002991; -; -. DR RZPD; Hs.808; -; Clones and other research
> material. DR CleanEx; HS_HNRPF; -; -. DR InterPro; IPR012677;
> a_b_plait_nuc_bd. DR InterPro; IPR000504; RNP1_RNA_bd. DR InterPro;
> IPR012996; Znf_CHHC. DR Pfam; PF00076; RRM_1; 3. DR Pfam; PF08080;
> zf-RNPHF; 1. DR SMART; SM00360; RRM; 3. DR PROSITE; PS50102; RRM; 2. DR
> GENE3D; G3D.3.30.70.330; Nucl_bd_a/b_plat; 3. SQ SEQUENCE 415 AA; 45672
> MW; D14E170631FB1F31 CRC64; MMLGPEGGEG FVVKLRGLPW SCSVEDVQNF LSDCTIHDGA
> AGVHFIYTRE GRQSGEAFVE LGSEDDVKMA LKKDRESMGH RYIEVFKSHR TEMDWVLKHS
> GPNSADSAND GFVRLRGLPF GCTKEEIVQF FSGLEIVPNG ITLPVDPEGK ITGEAFVQFA
> SQELAEKALG KHKERIGHRY IEVFKSSQEE VRSYSDPPLK FMSVQRPGPY DRPGTARRYI
> GIVKQAGLER MRPGAYSTGY GGYEEYSGLS DGYGFTTDLF GRDLSYCLSG MYDHRYGDSE
> FTVQSTTGHC VHMRGLPYKA TENDIYNFFS PLNPVRVHIE IGPDGRVTGE ADVEFATHEE
> AVAAMSKDRA NMQHRYIELF LNSTTGASNG AYSSQVMQGM GVSAAQATYS GLESQSVSGC
> YGAGYSGQNS MGGYD //
>
> I would be very grateful if you could help me, I look forward to
> hearing from you, best, mariaig
>
>