Split 10GB text file 1) with a minimum size of the output files of 40MB and 2) after a specific string ()
I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>
text</record>
) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>
. And it is also necessary that every part has got at least the size of about 40MB.
split
|
show 3 more comments
I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>
text</record>
) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>
. And it is also necessary that every part has got at least the size of about 40MB.
split
Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?
– Jacob Vlijm
Nov 21 '14 at 13:26
I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:
– Michael Käfer
Nov 21 '14 at 15:38
<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>
– Michael Käfer
Nov 21 '14 at 15:39
One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.
– Jacob Vlijm
Nov 21 '14 at 20:26
</record>
is the last line of every division (no characters before and after it). You have got a solution?
– Michael Käfer
Nov 22 '14 at 15:23
|
show 3 more comments
I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>
text</record>
) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>
. And it is also necessary that every part has got at least the size of about 40MB.
split
I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>
text</record>
) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>
. And it is also necessary that every part has got at least the size of about 40MB.
split
split
edited Jan 30 at 19:04
Michael Käfer
asked Nov 21 '14 at 11:35
Michael KäferMichael Käfer
1308
1308
Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?
– Jacob Vlijm
Nov 21 '14 at 13:26
I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:
– Michael Käfer
Nov 21 '14 at 15:38
<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>
– Michael Käfer
Nov 21 '14 at 15:39
One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.
– Jacob Vlijm
Nov 21 '14 at 20:26
</record>
is the last line of every division (no characters before and after it). You have got a solution?
– Michael Käfer
Nov 22 '14 at 15:23
|
show 3 more comments
Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?
– Jacob Vlijm
Nov 21 '14 at 13:26
I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:
– Michael Käfer
Nov 21 '14 at 15:38
<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>
– Michael Käfer
Nov 21 '14 at 15:39
One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.
– Jacob Vlijm
Nov 21 '14 at 20:26
</record>
is the last line of every division (no characters before and after it). You have got a solution?
– Michael Käfer
Nov 22 '14 at 15:23
Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?
– Jacob Vlijm
Nov 21 '14 at 13:26
Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?
– Jacob Vlijm
Nov 21 '14 at 13:26
I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:
– Michael Käfer
Nov 21 '14 at 15:38
I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:
– Michael Käfer
Nov 21 '14 at 15:38
<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>
– Michael Käfer
Nov 21 '14 at 15:39
<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>
– Michael Käfer
Nov 21 '14 at 15:39
One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.
– Jacob Vlijm
Nov 21 '14 at 20:26
One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.
– Jacob Vlijm
Nov 21 '14 at 20:26
</record>
is the last line of every division (no characters before and after it). You have got a solution?– Michael Käfer
Nov 22 '14 at 15:23
</record>
is the last line of every division (no characters before and after it). You have got a solution?– Michael Käfer
Nov 22 '14 at 15:23
|
show 3 more comments
1 Answer
1
active
oldest
votes
The script below slices a (large) file into slices. I didn't use the split
command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.
The procedure
Difficulties
Because the script should be able to deal with huge files, either python's read()
or readlines()
cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.
What seems to be the only option is to use:
with open(file) as src:
for line in src:
which reads the file line by line.
Approach
In the script I chose a two-step approach:
- Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).
- reading the file again, but now allocating the lines to separate files.
The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.
How I tested
I created an xml
file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb
. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:
analyzing file...
checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665
Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.
creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5
and so on...
The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.
The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.
The script
#!/usr/bin/env python3
import os
import time
#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---
line_number = -1
records = [0]
# analyzing file -------------------------------------------
print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)
# preparing data -------------------------------------------
rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line
# creating sections ----------------------------------------
sl = 1
line_number = 0
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)
with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1
How to use
Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py
and run it by the command:
/path/to/slice.py
Notes
- The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.
- The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f552184%2fsplit-10gb-text-file-1-with-a-minimum-size-of-the-output-files-of-40mb-and-2-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The script below slices a (large) file into slices. I didn't use the split
command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.
The procedure
Difficulties
Because the script should be able to deal with huge files, either python's read()
or readlines()
cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.
What seems to be the only option is to use:
with open(file) as src:
for line in src:
which reads the file line by line.
Approach
In the script I chose a two-step approach:
- Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).
- reading the file again, but now allocating the lines to separate files.
The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.
How I tested
I created an xml
file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb
. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:
analyzing file...
checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665
Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.
creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5
and so on...
The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.
The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.
The script
#!/usr/bin/env python3
import os
import time
#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---
line_number = -1
records = [0]
# analyzing file -------------------------------------------
print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)
# preparing data -------------------------------------------
rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line
# creating sections ----------------------------------------
sl = 1
line_number = 0
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)
with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1
How to use
Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py
and run it by the command:
/path/to/slice.py
Notes
- The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.
- The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
add a comment |
The script below slices a (large) file into slices. I didn't use the split
command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.
The procedure
Difficulties
Because the script should be able to deal with huge files, either python's read()
or readlines()
cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.
What seems to be the only option is to use:
with open(file) as src:
for line in src:
which reads the file line by line.
Approach
In the script I chose a two-step approach:
- Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).
- reading the file again, but now allocating the lines to separate files.
The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.
How I tested
I created an xml
file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb
. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:
analyzing file...
checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665
Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.
creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5
and so on...
The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.
The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.
The script
#!/usr/bin/env python3
import os
import time
#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---
line_number = -1
records = [0]
# analyzing file -------------------------------------------
print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)
# preparing data -------------------------------------------
rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line
# creating sections ----------------------------------------
sl = 1
line_number = 0
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)
with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1
How to use
Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py
and run it by the command:
/path/to/slice.py
Notes
- The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.
- The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
add a comment |
The script below slices a (large) file into slices. I didn't use the split
command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.
The procedure
Difficulties
Because the script should be able to deal with huge files, either python's read()
or readlines()
cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.
What seems to be the only option is to use:
with open(file) as src:
for line in src:
which reads the file line by line.
Approach
In the script I chose a two-step approach:
- Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).
- reading the file again, but now allocating the lines to separate files.
The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.
How I tested
I created an xml
file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb
. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:
analyzing file...
checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665
Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.
creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5
and so on...
The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.
The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.
The script
#!/usr/bin/env python3
import os
import time
#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---
line_number = -1
records = [0]
# analyzing file -------------------------------------------
print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)
# preparing data -------------------------------------------
rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line
# creating sections ----------------------------------------
sl = 1
line_number = 0
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)
with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1
How to use
Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py
and run it by the command:
/path/to/slice.py
Notes
- The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.
- The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).
The script below slices a (large) file into slices. I didn't use the split
command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.
The procedure
Difficulties
Because the script should be able to deal with huge files, either python's read()
or readlines()
cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.
What seems to be the only option is to use:
with open(file) as src:
for line in src:
which reads the file line by line.
Approach
In the script I chose a two-step approach:
- Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).
- reading the file again, but now allocating the lines to separate files.
The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.
How I tested
I created an xml
file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb
. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:
analyzing file...
checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665
Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.
creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5
and so on...
The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.
The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.
The script
#!/usr/bin/env python3
import os
import time
#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---
line_number = -1
records = [0]
# analyzing file -------------------------------------------
print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)
# preparing data -------------------------------------------
rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line
# creating sections ----------------------------------------
sl = 1
line_number = 0
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)
with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1
How to use
Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py
and run it by the command:
/path/to/slice.py
Notes
- The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.
- The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).
edited Nov 26 '14 at 23:13
answered Nov 24 '14 at 9:33
Jacob VlijmJacob Vlijm
64.9k9129225
64.9k9129225
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
add a comment |
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!
– Michael Käfer
Nov 25 '14 at 12:43
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!
– Michael Käfer
Nov 26 '14 at 19:54
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
@mischa004 Glad it works! It was fun working on it.
– Jacob Vlijm
Nov 26 '14 at 19:55
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f552184%2fsplit-10gb-text-file-1-with-a-minimum-size-of-the-output-files-of-40mb-and-2-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?
– Jacob Vlijm
Nov 21 '14 at 13:26
I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:
– Michael Käfer
Nov 21 '14 at 15:38
<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>
– Michael Käfer
Nov 21 '14 at 15:39
One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.
– Jacob Vlijm
Nov 21 '14 at 20:26
</record>
is the last line of every division (no characters before and after it). You have got a solution?– Michael Käfer
Nov 22 '14 at 15:23