openSUSE Forums > Archives > SF Archives > ARCHIVES - Programming & Scripting » Using Grep On Openoffice Files:

Go Back   openSUSE Forums > Archives > SF Archives > ARCHIVES - Programming & Scripting
Forums FAQ Members List Search Today's Posts Mark Forums Read

ARCHIVES - Programming & Scripting A place to discuss website design, programming, shell scripts, etc

 
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 08-Aug-2007, 07:12
duke
Guest
 
Posts: n/a
Thumbs up

Hi
I am in the process of learning scripting and doing thing using the command line on a Konsole.
I was trying to use grep on some OpenOffice.odt file that i have but with no success.
I did some research and found out that the reason is that the files are formatted?? and that is wy i can't use grep on them. Is there a way of using grep on those kind of files with out having to change them in txt files. Cause if i take OpenOffice.odt files and transfer them in txt files than i can use grep to search in them.
Thanks
  #2 (permalink)  
Old 08-Aug-2007, 07:44
Garvan
Guest
 
Posts: n/a
Default

The openoffice file (something.odt) is a zip file with a collection of files that make up the document inside. One of these files is a .XML file called contents.xml that contains the actual text of your document.

If you wished you could make a script that unziped it and searched the content.xml file. Is that what you wished?

Garvan

  #3 (permalink)  
Old 08-Aug-2007, 08:43
duke
Guest
 
Posts: n/a
Default

Quote:
The openoffice file (something.odt) is a zip file with a collection of files that make up the document inside. One of these files is a .XML file called contents.xml that contains the actual text of your document.

If you wished you could make a script that unziped it and searched the content.xml file. Is that what you wished?

Garvan
[/b]
Hi
Well yes that in a way answers a few question on why i can't use the grep function on openoffice.odt files.I see you mention .XML files what about using grep to search them if yes how??.
XML file are these kinda web files??. Remember i am new to all of this i am still in the process of learning.
Thanks
  #4 (permalink)  
Old 10-Aug-2007, 04:07
duke
Guest
 
Posts: n/a
Default

Quote:
Hi
Well yes that in a way answers a few question on why i can't use the grep function on openoffice.odt files.I see you mention .XML files what about using grep to search them if yes how??.
XML file are these kinda web files??. Remember i am new to all of this i am still in the process of learning.
Thanks
[/b]

The problem has been solved; You cant use (grep) on OpenOffice .odt files you have to transfer them to .txt first.
  #5 (permalink)  
Old 18-Aug-2007, 19:14
AndrewTheArt
Guest
 
Posts: n/a
Default

I was looking into this and actually wrote a batch script to unzip the ODT and search the correct file.. however, there is one problem. The file that contains the actual text is only one long line, not separate lines. When grep finds an occurrence of your word, it would just return the entire line every single time. It sucks...

Here's how far I got.

Code:
#!/bin/bash
mv $1.odt $1.zip
unzip $1.zip -d $1
cd $1
echo "Type in a search term."
read search
cat content.xml | grep "$search"
cd ..
rm -r $1
mv $1.zip $1.odt
Grep is just doing it's job, in reality. I'm sure there are some command line switches to modify grep for this purpose, but I'm not exactly excited about learning about them.

Maybe somebody with more grep experience than me would know how to modify the default delimiter (from a carriage return to something else?)

This

Code:
<text:p text:style-name="Standard">This is a simple test.</text:p><text:p text:style-name="Standard">I hope it works.</text:p><text:p text:style-name="Standard">Woohoo!</text:p></office:text></office:body></office:document-content>
Is equivalent to this

Code:
This is a simple test.
I hope it works.
Woohoo!
The delimiter needs to be </text> for most documents, because it essentially represents a carriage return for our purposes (but Grep doesn't know this)
  #6 (permalink)  
Old 18-Aug-2007, 19:23
ken_yap
Guest
 
Posts: n/a
Default

You could feed the xml throuh tidy first which will reformat the XML:

tidy -q -xml < $i | grep "$search"
  #7 (permalink)  
Old 18-Aug-2007, 19:27
AndrewTheArt
Guest
 
Posts: n/a
Default

I was personally trying to avoid any dependencies beyond the basics, but I guess it's a necessity to reformat the XML.

So, with Ken's help, I've designed a relatively crude ODT search...

Code:
#!/bin/bash

mv $1.odt $1.zip
unzip -qq $1.zip -d $1
cd $1
echo "Instances of $2"
tidy -q -xml < content.xml | grep "$2"
cd ..
rm -r $1
mv $1.zip $1.odt
Command line usage is

Code:
sh SearchODT.sh ODTFileName "Search term goes here"
(Assuming you name your batch script SearchODT)
(Do not put the .odt at the end of the filename...)

It has some minor formatting issues, but it basically get the job done. Further refinement could come later.
  #8 (permalink)  
Old 18-Aug-2007, 20:37
Garvan
Guest
 
Posts: n/a
Default

I have a batch script and python script combination that I found in LinuxFormat for searching and replacing in a OpenOffice file. It changes quote formating for when you cut and paste documents not created in OpenOffice. Might as well post it:

Code:
#!/bin/bash
# Script to change quotes in Open Office
cd ~/bin/odtscript
TMPDIR=/tmp/ODFfixit.$(date +%y%m%d.%H%M%S).$$
if rm -rf $TMPDIR && mkdir $TMPDIR; then
****: #be happy
else
****echo >&2 "Can't (re)create $TMPDIR; aborting"
****exit 1
fi

OLDFILE=$1
NEWFILE=$2

if [[ $# -eq 2 ]] &&
****touch $NEWFILE && rm -f $NEWFILE &&
****unzip -q $OLDFILE -d $TMPDIR; then
****: # All goog
else
****echo >&2 "Usage: $0 OLDFILE NEWFILE"
****rm -rf $TMPDIR
****exit 1
fi

F=$(unzip -l $OLDFILE | sed -n '/:[0-9][0-9]/s|^.*:.. *||p')
if echo "$F" | grep -q '^content\.xml$'; then
****: # good
else
****echo >&2 "content.xml not in $OLDFILE; aborting"
****exit 1
fi

mv $TMPDIR/content.xml $TMPDIR/OLDcontent.xml


if ./fixit.py $TMPDIR/OLDcontent.xml > $TMPDIR/content.xml; then
****: # worked
else
****echo ?&2 "fixit.py failed in $TMPDIR; aborting"
****exit 1
fi

if (cd $TMPDIR; zip -q - $F) | cat > $NEWFILE; then
****# worked?
****rm -rf $TMPDIR
else
****#something bad
****echo >&2 "zip failed in $TMPDIR on $F"
fi
Code:
#!/usr/bin/python -tt
import xml.dom.minidom
import sys
import re

DEBUG = 0

def dprint(what):
****if DEBUG == 0 :**return
****sys.stderr.write(what.encode('ascii','replace') + '\n')

emDash =u'\u2014'
enDash =u'\u2013'
sDquote=u'\u201c'
eDquote=u'\u201d'
sSquote=u'\u2018'
eSquote=u'\u2019'

sDpat = re.compile(r'(\A|(?<=\s))"(?=\S)',re.U)
eDpat = re.compile(r'("\Z)|("(?=\s))', re.U)
sSpat = re.compile(r"(\A|(?<=\s))'(?=\S)", re.U)
eSpat = re.compile(r"(?<=\S)'", re.U)

def fixdata(td, depth):
****dprint("depth=%d: childNode: %s" %(depth, td.data))
****
****td.data = td.data.replace('--', emDash)
****td.data = td.data.replace(enDash, emDash)
****td.data = sDpat.sub(sDquote, td.data)
****td.data = eDpat.sub(eDquote, td.data)
****td.data = sSpat.sub(sSquote, td.data)
****td.data = eSpat.sub(eSquote, td.data)

def handle_xml_tree(aNode, depth):
****if aNode.hasChildNodes():
********for kid in aNode.childNodes:
************handle_xml_tree(kid, depth+1)
****else:
********if 'data' in dir(aNode):
************fixdata(aNode, depth)
****
def doit(argv):
****doc = xml.dom.minidom.parse(argv[1])
****handle_xml_tree(doc, 0)
****sys.stdout.write(doc.toxml('utf-8'))
****************
if __name__ == "__main__":
****doit(sys.argv)
For an explanation of the Code you might search for archives of old LinuxFormat magazines.

Garvan
  #9 (permalink)  
Old 18-Aug-2007, 21:11
AndrewTheArt
Guest
 
Posts: n/a
Default

Looks good, but in terms of code compactness, mine might just win
Although that one looks a lot more robust.
 

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




 

Search Engine Friendly URLs by vBSEO 3.3.0 RC2