I am a new programmer and we are working on a Graduate English project where we are trying to parse a gigantic dictionary text file (500 MB). The file is set up with html-like tags. I have 179 author tags eg. "[A>]Shakes.[/A]" for Shakespeare and what I need to do is find each occurrence of every tag and then write that tag and what follows on the line until I get to "[/W]".
My problem is that readlines() gives me a memory error (I am assuming because the file is so large) and I have been able to find matches (but only once) and not been able to get it to look past the first match. Any help that anyone could give would be greatly appreciated.
There are no new lines in the text file which I think is causing the problem. This problem has been solved. I thought I would include the code that worked:
with open('/Users/Desktop/Poetrylist.txt','w') as output_file: with open('/Users/Desktop/2e.txt','r') as open_file: the_whole_file = open_file.read() start_position = 0 while True: start_position = the_whole_file.find('<A>', start_position) if start_position < 0: break start_position += 3 end_position = the_whole_file.find('</W>', start_position) output_file.write(the_whole_file[start_position:end_position]) output_file.write("\n") start_position = end_position + 4Answer1:
I don't know regular expressions well, but you can solve this problem without them, using the string method find() and line slicing.
answer = '' with open('yourFile.txt','r') as open_file, open('output_file','w') as output_file: for each_line in open_file: if each_line.find('[A>]'): start_position = each_line.find('[A>]') start_position = start_position + 3 end_position = each_line[start_position:].find('[/W]') answer = each_line[start_position:end_position] + '\n' output_file.write(answer)
Let me explain what is happening:<ol><li>Create an empty 'list' using = . This will hold your answers. </li> <li>Use the with... statement. This allows you to open your file as an alias (I chose open_file). This ensures automatic closing of your file whether or not your program runs correctly.</li> <li>We use the 'for line in file:' idiom to tackle the file one line at a time. The 'line' variable can be named anything (i.e. for x in file, for pizza in file) and will always contain each line as a string. When it gets to the end of the file, it automatically stops.</li> <li>the 'if each_line.find('[A>]'):' statement simply tests if the starting tag is in that line. If it is not, none of the indented code that follows will run, and the loop will re-start, moving to the next line.</li> <li>We use string slicing, where we can cut out the part of the string we want. What we do is search for the first tag by position (which we already know is in this line), then search for the stop tag by position. Once we have those, we can simply cut out the part we want.</li> <li>I buffed up the position in two ways. 1 I added 3 to the start position so it would skip over the [A>] - thus instead of giving '[A>] THIS IS MY STRING...' it just gives 'THIS IS MY STRING...' I then searched for the end position by looking for its first occurence AFTER the [A>] tag, inc ase the [/W] tag occurrs more than once each line. </li> <li>We set the answer to the string slice, and a new line character ('\n') so each string appears on its own line. We use the output method .write('stringToWrite') to write each string, one at a time. </li> </ol>Answer2:
After opening the file, iterate through the lines like this:
input_file = open('huge_file.txt', 'r') for input_line in input_file: # process the line however you need - consider learning some basic regular expressions
This will allow you to easily process the file by reading it in line by line as needed rather than loading it all into memory at onceAnswer3:
You're getting a memory error with readlines() because given the filesize you're likely reading in more data than your memory can reasonably handle. Since this file is an XML file, you should be able to read through it iterparse(), which will parse the XML lazily without taking up excess memory. Here's some code I used to parse Wikipedia dumps:
for event, elem in parser: if event == 'start' and root == None: root = elem elif event == 'end' and elem.tag == namespace + 'title': page_title = elem.text #This clears bits of the tree we no longer use. elem.clear() elif event == 'end' and elem.tag == namespace + 'text': page_text = elem.text #Clear bits of the tree we no longer use elem.clear() #Now lets grab all of the outgoing links and store them in a list key_vals =  #Eliminate duplicate outgoing links. key_vals = set(key_vals) key_vals = list(key_vals) count += 1 if count % 1000 == 0: print str(count) + ' records processed.' elif event == 'end' and elem.tag == namespace + 'page': root.clear()
Here's roughly how it works:<ol><li>
We create parser to progress through the document.</li> <li>
As we loop through each element of the document, we look for elements with the tag you are looking for (in your example it was 'A').</li> <li>
We store that data and process it. Any element we are done processing we clear, because as we go through the document it remains in memory, so we want to remove anything we no longer need.</li> </ol>Answer4:
You should look into a tool called "Grep". You can give it a pattern to match and a file, and it will print out occurences in the file and line numbers, if you want. Very useful and probably can be interfaced with Python.Answer5:
Instead of parsing the file by hand why not parse it as <a href="http://docs.python.org/library/xml.etree.elementtree.html" rel="nofollow">XML</a> to have better control of the data? You mentioned that the data is HTML-like so I assume it is parseable as an XML document.Answer6:
Please, test the following code:
import re regx = re.compile('<A>.+?</A>.*?<W>.*?</W>') with open('/Users/Desktop/2e.txt','rb') as open_file,\ open('/Users/Desktop/Poetrylist.txt','wb') as output_file: remain = '' while True: chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16 if not chunk: break output_file.writelines( mat.group() + '\n' for mat in regx.finditer(remain + chunk) ) remain = chunk[mat.end(0)-len(remain):]
I couldn't test it because I have no file to test on.