|
||||||
| Forums FAQ | Members List | Search | Today's Posts | Mark Forums Read |
| ARCHIVES - Programming & Scripting A place to discuss website design, programming, shell scripts, etc |
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
Hi
I am in the process of learning scripting and doing thing using the command line on a Konsole. I was trying to use grep on some OpenOffice.odt file that i have but with no success. I did some research and found out that the reason is that the files are formatted?? and that is wy i can't use grep on them. Is there a way of using grep on those kind of files with out having to change them in txt files. Cause if i take OpenOffice.odt files and transfer them in txt files than i can use grep to search in them. Thanks |
|
|||
|
The openoffice file (something.odt) is a zip file with a collection of files that make up the document inside. One of these files is a .XML file called contents.xml that contains the actual text of your document.
If you wished you could make a script that unziped it and searched the content.xml file. Is that what you wished? Garvan |
|
|||
|
Quote:
Well yes that in a way answers a few question on why i can't use the grep function on openoffice.odt files.I see you mention .XML files what about using grep to search them if yes how??. XML file are these kinda web files??. Remember i am new to all of this i am still in the process of learning. Thanks |
|
|||
|
Quote:
The problem has been solved; You cant use (grep) on OpenOffice .odt files you have to transfer them to .txt first. |
|
|||
|
I was looking into this and actually wrote a batch script to unzip the ODT and search the correct file.. however, there is one problem. The file that contains the actual text is only one long line, not separate lines. When grep finds an occurrence of your word, it would just return the entire line every single time. It sucks...
Here's how far I got. Code:
#!/bin/bash mv $1.odt $1.zip unzip $1.zip -d $1 cd $1 echo "Type in a search term." read search cat content.xml | grep "$search" cd .. rm -r $1 mv $1.zip $1.odt Maybe somebody with more grep experience than me would know how to modify the default delimiter (from a carriage return to something else?) This Code:
<text:p text:style-name="Standard">This is a simple test.</text:p><text:p text:style-name="Standard">I hope it works.</text:p><text:p text:style-name="Standard">Woohoo!</text:p></office:text></office:body></office:document-content> Code:
This is a simple test. I hope it works. Woohoo! > for most documents, because it essentially represents a carriage return for our purposes (but Grep doesn't know this)
|
|
|||
|
You could feed the xml throuh tidy first which will reformat the XML:
tidy -q -xml < $i | grep "$search" |
|
|||
|
I was personally trying to avoid any dependencies beyond the basics, but I guess it's a necessity to reformat the XML.
So, with Ken's help, I've designed a relatively crude ODT search... Code:
#!/bin/bash mv $1.odt $1.zip unzip -qq $1.zip -d $1 cd $1 echo "Instances of $2" tidy -q -xml < content.xml | grep "$2" cd .. rm -r $1 mv $1.zip $1.odt Code:
sh SearchODT.sh ODTFileName "Search term goes here" (Do not put the .odt at the end of the filename...) It has some minor formatting issues, but it basically get the job done. Further refinement could come later. |
|
|||
|
I have a batch script and python script combination that I found in LinuxFormat for searching and replacing in a OpenOffice file. It changes quote formating for when you cut and paste documents not created in OpenOffice. Might as well post it:
Code:
#!/bin/bash # Script to change quotes in Open Office cd ~/bin/odtscript TMPDIR=/tmp/ODFfixit.$(date +%y%m%d.%H%M%S).$$ if rm -rf $TMPDIR && mkdir $TMPDIR; then ****: #be happy else ****echo >&2 "Can't (re)create $TMPDIR; aborting" ****exit 1 fi OLDFILE=$1 NEWFILE=$2 if [[ $# -eq 2 ]] && ****touch $NEWFILE && rm -f $NEWFILE && ****unzip -q $OLDFILE -d $TMPDIR; then ****: # All goog else ****echo >&2 "Usage: $0 OLDFILE NEWFILE" ****rm -rf $TMPDIR ****exit 1 fi F=$(unzip -l $OLDFILE | sed -n '/:[0-9][0-9]/s|^.*:.. *||p') if echo "$F" | grep -q '^content\.xml$'; then ****: # good else ****echo >&2 "content.xml not in $OLDFILE; aborting" ****exit 1 fi mv $TMPDIR/content.xml $TMPDIR/OLDcontent.xml if ./fixit.py $TMPDIR/OLDcontent.xml > $TMPDIR/content.xml; then ****: # worked else ****echo ?&2 "fixit.py failed in $TMPDIR; aborting" ****exit 1 fi if (cd $TMPDIR; zip -q - $F) | cat > $NEWFILE; then ****# worked? ****rm -rf $TMPDIR else ****#something bad ****echo >&2 "zip failed in $TMPDIR on $F" fi Code:
#!/usr/bin/python -tt
import xml.dom.minidom
import sys
import re
DEBUG = 0
def dprint(what):
****if DEBUG == 0 :**return
****sys.stderr.write(what.encode('ascii','replace') + '\n')
emDash =u'\u2014'
enDash =u'\u2013'
sDquote=u'\u201c'
eDquote=u'\u201d'
sSquote=u'\u2018'
eSquote=u'\u2019'
sDpat = re.compile(r'(\A|(?<=\s))"(?=\S)',re.U)
eDpat = re.compile(r'("\Z)|("(?=\s))', re.U)
sSpat = re.compile(r"(\A|(?<=\s))'(?=\S)", re.U)
eSpat = re.compile(r"(?<=\S)'", re.U)
def fixdata(td, depth):
****dprint("depth=%d: childNode: %s" %(depth, td.data))
****
****td.data = td.data.replace('--', emDash)
****td.data = td.data.replace(enDash, emDash)
****td.data = sDpat.sub(sDquote, td.data)
****td.data = eDpat.sub(eDquote, td.data)
****td.data = sSpat.sub(sSquote, td.data)
****td.data = eSpat.sub(eSquote, td.data)
def handle_xml_tree(aNode, depth):
****if aNode.hasChildNodes():
********for kid in aNode.childNodes:
************handle_xml_tree(kid, depth+1)
****else:
********if 'data' in dir(aNode):
************fixdata(aNode, depth)
****
def doit(argv):
****doc = xml.dom.minidom.parse(argv[1])
****handle_xml_tree(doc, 0)
****sys.stdout.write(doc.toxml('utf-8'))
****************
if __name__ == "__main__":
****doit(sys.argv)
Garvan |
| Bookmarks |
| Thread Tools | |
| Display Modes | |
|
|