
Question:
I have 2 big text files like the following small examples. there are 2 files (major and minor
).
in both major
and minor
files there are 4 columns. in the major file the difference between 2nd and 3rd columns in 10000 and the difference between 2nd and 3rd columns is 32 or 31 or a number close to 31 but not so high.
small example of major file:
chr4 530000 540000 0.0
chr4 540000 550000 1719.0
chr4 550000 560000 0.0
small example of minor file:
chr4 295577 295608 12
chr4 323326 323357 10
chr4 548873 548904 32
chr4 548873 548904 20
chr4 549047 549078 32
chr4 549047 549078 20
chr4 549137 549168 32
chr4 549137 549168 20
chr4 549181 549212 32
chr4 549181 549212 20
chr4 549269 549300 22
chr4 549269 549300 381
chr4 549269 549300 67
chr4 549269 549300 89
chr4 549269 549300 95
chr4 549269 549300 124
chr4 549269 549300 149
chr4 549269 549300 87
chr4 549269 549300 33
chr4 549269 549300 65
chr4 549269 549300 68
chr4 549269 549300 190
chr4 549269 549300 20
chr4 549355 549386 32
chr4 549355 549386 20
chr4 549443 549474 16
chr4 705810 705841 10
chr4 846893 846924 28
I want to make a new text file in which there would be 4 columns. like the expected output:
expected output:
chr4 548873 548904 32 chr4 540000 550000
chr4 548873 548904 20 chr4 540000 550000
chr4 549047 549078 32 chr4 540000 550000
chr4 549047 549078 20 chr4 540000 550000
chr4 549137 549168 32 chr4 540000 550000
chr4 549137 549168 20 chr4 540000 550000
chr4 549181 549212 32 chr4 540000 550000
chr4 549181 549212 20 chr4 540000 550000
chr4 549269 549300 22 chr4 540000 550000
chr4 549269 549300 381 chr4 540000 550000
chr4 549269 549300 67 chr4 540000 550000
chr4 549269 549300 89 chr4 540000 550000
chr4 549269 549300 95 chr4 540000 550000
chr4 549269 549300 124 chr4 540000 550000
chr4 549269 549300 149 chr4 540000 550000
chr4 549269 549300 87 chr4 540000 550000
chr4 549269 549300 33 chr4 540000 550000
chr4 549269 549300 65 chr4 540000 550000
chr4 549269 549300 68 chr4 540000 550000
chr4 549269 549300 190 chr4 540000 550000
chr4 549269 549300 20 chr4 540000 550000
chr4 549355 549386 32 chr4 540000 550000
chr4 549355 549386 20 chr4 540000 550000
chr4 549443 549474 16 chr4 540000 550000
the first 4 columns are from the minor file
and the last 3 columns are from the major file
.
looking at the expected output the number in the 2nd
and 3rd
columns (from minor file) are in the range of the same row but columns 6
and 7 (from major file) and 1st column is equal to the 5th
column (in fact the 1st columns of both major and minor files).
in fact I want to look for the rows in minor file in which the first column is equal to the 1st column of major file, also 2nd
and 3rd
columns of the same row (in minor file) must be in a range of 2nd
and the 3rd
columns in the major file. so in fact there are 3 conditions for every row in the minor file to be eligible to be included in the output file. and the last 3 columns are from the major file which fit the rows from minor file.
I am trying to do that in python and have made the following code but it does not return what I expected:
major = open("major.txt", 'rb')
minor = open("minor.txt", 'rb')
major_list = []
minor_list = []
for m in major:
major_list.append(m)
for n in minor:
minor_list.append(n)
final = []
for i in minor_list:
for j in major_list
if minor_list[i] == major_list[j] and minor_list[i+1] <= major_list[j+1] and minor_list[i+2] >= major_list[j+2]:
final.append(i)
with open('output.txt', 'w') as f:
for item in final:
f.write("%s\n" % item)
Answer1:You should do something like this
final = []
for i, j in zip(minor_list, major_list):
final.append(i, j)
Answer2:Maybe its a typo in your code I can see that your missing a tab at your if minor_list[i]
final = []
for i in minor_list:
for j in major_list
if minor_list[i] == major_list[j] and minor_list[i+1] <= major_list[j+1] and minor_list[i+2] >= major_list[j+2]:
final.append(i)
should be
final = []
for i in minor_list:
for j in major_list
if minor_list[i] == major_list[j] and minor_list[i+1] <= major_list[j+1] and minor_list[i+2] >= major_list[j+2]:
final.append(i)
Answer3:Do you HAVE to use Python for this? If you install "bedtools" in bash shell, this can be accomplished with the following line:
bedtools intersect -wa -wb -a minor.bed -b major.bed > intersected_file.bed
A few bioinformatics tools are linux/mac-only, so if you're going to be doing any amount of bioinformatics, it's worth learning how to script in shell.