]Shakes.[/A]" for Shakespeare and what" name="description" /> ]Shakes.[/A]" for Shakespeare and what" />

Searching TXT files in Python


I am a new programmer and we are working on a Graduate English project where we are trying to parse a gigantic dictionary text file (500 MB). The file is set up with html-like tags. I have 179 author tags eg. "[A>]Shakes.[/A]" for Shakespeare and what I need to do is find each occurrence of every tag and then write that tag and what follows on the line until I get to "[/W]".

My problem is that readlines() gives me a memory error (I am assuming because the file is so large) and I have been able to find matches (but only once) and not been able to get it to look past the first match. Any help that anyone could give would be greatly appreciated.

There are no new lines in the text file which I think is causing the problem. This problem has been solved. I thought I would include the code that worked:

with open('/Users/Desktop/Poetrylist.txt','w') as output_file: with open('/Users/Desktop/2e.txt','r') as open_file: the_whole_file = open_file.read() start_position = 0 while True: start_position = the_whole_file.find('<A>', start_position) if start_position < 0: break start_position += 3 end_position = the_whole_file.find('</W>', start_position) output_file.write(the_whole_file[start_position:end_position]) output_file.write("\n") start_position = end_position + 4


I don't know regular expressions well, but you can solve this problem without them, using the string method find() and line slicing.

answer = '' with open('yourFile.txt','r') as open_file, open('output_file','w') as output_file: for each_line in open_file: if each_line.find('[A>]'): start_position = each_line.find('[A>]') start_position = start_position + 3 end_position = each_line[start_position:].find('[/W]') answer = each_line[start_position:end_position] + '\n' output_file.write(answer)

Let me explain what is happening:

<ol><li>Create an empty 'list' using = []. This will hold your answers. </li> <li>Use the with... statement. This allows you to open your file as an alias (I chose open_file). This ensures automatic closing of your file whether or not your program runs correctly.</li> <li>We use the 'for line in file:' idiom to tackle the file one line at a time. The 'line' variable can be named anything (i.e. for x in file, for pizza in file) and will always contain each line as a string. When it gets to the end of the file, it automatically stops.</li> <li>the 'if each_line.find('[A>]'):' statement simply tests if the starting tag is in that line. If it is not, none of the indented code that follows will run, and the loop will re-start, moving to the next line.</li> <li>We use string slicing, where we can cut out the part of the string we want. What we do is search for the first tag by position (which we already know is in this line), then search for the stop tag by position. Once we have those, we can simply cut out the part we want.</li> <li>I buffed up the position in two ways. 1 I added 3 to the start position so it would skip over the [A>] - thus instead of giving '[A>] THIS IS MY STRING...' it just gives 'THIS IS MY STRING...' I then searched for the end position by looking for its first occurence AFTER the [A>] tag, inc ase the [/W] tag occurrs more than once each line. </li> <li>We set the answer to the string slice, and a new line character ('\n') so each string appears on its own line. We use the output method .write('stringToWrite') to write each string, one at a time. </li> </ol>


After opening the file, iterate through the lines like this:

input_file = open('huge_file.txt', 'r') for input_line in input_file: # process the line however you need - consider learning some basic regular expressions

This will allow you to easily process the file by reading it in line by line as needed rather than loading it all into memory at once


You're getting a memory error with readlines() because given the filesize you're likely reading in more data than your memory can reasonably handle. Since this file is an XML file, you should be able to read through it iterparse(), which will parse the XML lazily without taking up excess memory. Here's some code I used to parse Wikipedia dumps:

for event, elem in parser: if event == 'start' and root == None: root = elem elif event == 'end' and elem.tag == namespace + 'title': page_title = elem.text #This clears bits of the tree we no longer use. elem.clear() elif event == 'end' and elem.tag == namespace + 'text': page_text = elem.text #Clear bits of the tree we no longer use elem.clear() #Now lets grab all of the outgoing links and store them in a list key_vals = [] #Eliminate duplicate outgoing links. key_vals = set(key_vals) key_vals = list(key_vals) count += 1 if count % 1000 == 0: print str(count) + ' records processed.' elif event == 'end' and elem.tag == namespace + 'page': root.clear()

Here's roughly how it works:


We create parser to progress through the document.

</li> <li>

As we loop through each element of the document, we look for elements with the tag you are looking for (in your example it was 'A').

</li> <li>

We store that data and process it. Any element we are done processing we clear, because as we go through the document it remains in memory, so we want to remove anything we no longer need.

</li> </ol>


You should look into a tool called "Grep". You can give it a pattern to match and a file, and it will print out occurences in the file and line numbers, if you want. Very useful and probably can be interfaced with Python.


Instead of parsing the file by hand why not parse it as <a href="http://docs.python.org/library/xml.etree.elementtree.html" rel="nofollow">XML</a> to have better control of the data? You mentioned that the data is HTML-like so I assume it is parseable as an XML document.


Please, test the following code:

import re regx = re.compile('<A>.+?</A>.*?<W>.*?</W>') with open('/Users/Desktop/2e.txt','rb') as open_file,\ open('/Users/Desktop/Poetrylist.txt','wb') as output_file: remain = '' while True: chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16 if not chunk: break output_file.writelines( mat.group() + '\n' for mat in regx.finditer(remain + chunk) ) remain = chunk[mat.end(0)-len(remain):]

I couldn't test it because I have no file to test on.


  • Multiple languages in one HTML page
  • Getting double value from the quotient of integers
  • In a django custom field, what does the SubfieldBase metaclass do?
  • HALF_PTR Windows data type
  • Dependable views in Ember
  • How can I determine which routines MATLAB uses to solve a sparse matrix?
  • Meteor.. accounts- password— Create account on client without login
  • Can XOR be expressed using SKI combinators?
  • (Play 2.5) How do you define json format for type alias of an Option?
  • Is it safe to drop the -webkit vendor prefix from the css3 border-radius yet?
  • F#: In which memory area is the continuation stored: stack or heap?
  • Avoid registering duplicate broadcast receivers in Android
  • Java making confirming exit
  • .NET video play library which allows to change the playback rate?
  • Most efficient way to move table rows from one table to another
  • How to 'create temp table as select' in Slick?
  • how to set variables in a php include file?
  • MySQL Order by column = x, column asc?
  • Jackson Parser: ignore deserializing for type mismatch
  • How to define and use opencv mat of user type
  • How to use remove-erase idiom for removing empty vectors in a vector?
  • Extracting HTML between tags
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Disabling Alt-F4 on a Win Forms NotifyIcon
  • Why is an OPTIONS request sent to the server?
  • Nant, Vault & Windows Integrated Authentication
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • TFS: Get latest causes slow project reloading
  • Controls, properties, events and timers running in design time
  • vba code to select only visible cells in specific column except heading
  • Weird JavaScript statement, what does it mean?
  • How do I rollback to a specific git commit
  • Comma separated Values
  • Error creating VM instance in Google Compute Engine
  • Hits per day in Google Big Query
  • how does django model after text[] in postgresql [duplicate]
  • Turn off referential integrity in Derby? is it possible?
  • Add sale price programmatically to product variations
  • Unable to use reactive element in my shiny app
  • How do I use LINQ to get all the Items that have a particular SubItem?