Split 10GB text file 1) with a minimum size of the output files of 40MB and 2) after a specific string ()












2















I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>text</record>) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>. And it is also necessary that every part has got at least the size of about 40MB.










share|improve this question

























  • Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

    – Jacob Vlijm
    Nov 21 '14 at 13:26











  • I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

    – Michael Käfer
    Nov 21 '14 at 15:38











  • <record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

    – Michael Käfer
    Nov 21 '14 at 15:39













  • One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

    – Jacob Vlijm
    Nov 21 '14 at 20:26













  • </record> is the last line of every division (no characters before and after it). You have got a solution?

    – Michael Käfer
    Nov 22 '14 at 15:23
















2















I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>text</record>) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>. And it is also necessary that every part has got at least the size of about 40MB.










share|improve this question

























  • Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

    – Jacob Vlijm
    Nov 21 '14 at 13:26











  • I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

    – Michael Käfer
    Nov 21 '14 at 15:38











  • <record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

    – Michael Käfer
    Nov 21 '14 at 15:39













  • One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

    – Jacob Vlijm
    Nov 21 '14 at 20:26













  • </record> is the last line of every division (no characters before and after it). You have got a solution?

    – Michael Käfer
    Nov 22 '14 at 15:23














2












2








2








I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>text</record>) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>. And it is also necessary that every part has got at least the size of about 40MB.










share|improve this question
















I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>text</record>) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>. And it is also necessary that every part has got at least the size of about 40MB.







split






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 30 at 19:04







Michael Käfer

















asked Nov 21 '14 at 11:35









Michael KäferMichael Käfer

1308




1308













  • Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

    – Jacob Vlijm
    Nov 21 '14 at 13:26











  • I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

    – Michael Käfer
    Nov 21 '14 at 15:38











  • <record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

    – Michael Käfer
    Nov 21 '14 at 15:39













  • One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

    – Jacob Vlijm
    Nov 21 '14 at 20:26













  • </record> is the last line of every division (no characters before and after it). You have got a solution?

    – Michael Käfer
    Nov 22 '14 at 15:23



















  • Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

    – Jacob Vlijm
    Nov 21 '14 at 13:26











  • I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

    – Michael Käfer
    Nov 21 '14 at 15:38











  • <record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

    – Michael Käfer
    Nov 21 '14 at 15:39













  • One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

    – Jacob Vlijm
    Nov 21 '14 at 20:26













  • </record> is the last line of every division (no characters before and after it). You have got a solution?

    – Michael Käfer
    Nov 22 '14 at 15:23

















Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26





Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26













I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38





I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38













<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

– Michael Käfer
Nov 21 '14 at 15:39







<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

– Michael Käfer
Nov 21 '14 at 15:39















One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26







One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26















</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23





</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23










1 Answer
1






active

oldest

votes


















3














The script below slices a (large) file into slices. I didn't use the split command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.



The procedure



Difficulties

Because the script should be able to deal with huge files, either python's read() or readlines() cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.



What seems to be the only option is to use:



with open(file) as src:
for line in src:


which reads the file line by line.



Approach

In the script I chose a two-step approach:




  1. Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

  2. reading the file again, but now allocating the lines to separate files.


The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.



How I tested

I created an xml file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:



analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665


Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.



creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5


and so on...



The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.



The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.



The script



#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)

with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1


How to use



Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:



/path/to/slice.py


Notes




  • The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

  • The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).






share|improve this answer


























  • Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

    – Michael Käfer
    Nov 25 '14 at 12:43











  • This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

    – Michael Käfer
    Nov 26 '14 at 19:54











  • @mischa004 Glad it works! It was fun working on it.

    – Jacob Vlijm
    Nov 26 '14 at 19:55













Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f552184%2fsplit-10gb-text-file-1-with-a-minimum-size-of-the-output-files-of-40mb-and-2-a%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














The script below slices a (large) file into slices. I didn't use the split command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.



The procedure



Difficulties

Because the script should be able to deal with huge files, either python's read() or readlines() cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.



What seems to be the only option is to use:



with open(file) as src:
for line in src:


which reads the file line by line.



Approach

In the script I chose a two-step approach:




  1. Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

  2. reading the file again, but now allocating the lines to separate files.


The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.



How I tested

I created an xml file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:



analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665


Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.



creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5


and so on...



The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.



The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.



The script



#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)

with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1


How to use



Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:



/path/to/slice.py


Notes




  • The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

  • The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).






share|improve this answer


























  • Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

    – Michael Käfer
    Nov 25 '14 at 12:43











  • This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

    – Michael Käfer
    Nov 26 '14 at 19:54











  • @mischa004 Glad it works! It was fun working on it.

    – Jacob Vlijm
    Nov 26 '14 at 19:55


















3














The script below slices a (large) file into slices. I didn't use the split command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.



The procedure



Difficulties

Because the script should be able to deal with huge files, either python's read() or readlines() cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.



What seems to be the only option is to use:



with open(file) as src:
for line in src:


which reads the file line by line.



Approach

In the script I chose a two-step approach:




  1. Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

  2. reading the file again, but now allocating the lines to separate files.


The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.



How I tested

I created an xml file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:



analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665


Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.



creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5


and so on...



The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.



The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.



The script



#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)

with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1


How to use



Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:



/path/to/slice.py


Notes




  • The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

  • The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).






share|improve this answer


























  • Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

    – Michael Käfer
    Nov 25 '14 at 12:43











  • This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

    – Michael Käfer
    Nov 26 '14 at 19:54











  • @mischa004 Glad it works! It was fun working on it.

    – Jacob Vlijm
    Nov 26 '14 at 19:55
















3












3








3







The script below slices a (large) file into slices. I didn't use the split command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.



The procedure



Difficulties

Because the script should be able to deal with huge files, either python's read() or readlines() cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.



What seems to be the only option is to use:



with open(file) as src:
for line in src:


which reads the file line by line.



Approach

In the script I chose a two-step approach:




  1. Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

  2. reading the file again, but now allocating the lines to separate files.


The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.



How I tested

I created an xml file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:



analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665


Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.



creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5


and so on...



The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.



The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.



The script



#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)

with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1


How to use



Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:



/path/to/slice.py


Notes




  • The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

  • The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).






share|improve this answer















The script below slices a (large) file into slices. I didn't use the split command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.



The procedure



Difficulties

Because the script should be able to deal with huge files, either python's read() or readlines() cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.



What seems to be the only option is to use:



with open(file) as src:
for line in src:


which reads the file line by line.



Approach

In the script I chose a two-step approach:




  1. Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

  2. reading the file again, but now allocating the lines to separate files.


The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.



How I tested

I created an xml file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:



analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665


Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.



creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5


and so on...



The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.



The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.



The script



#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml"
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
for line in src:
line_number = line_number+1
if identifying_string in line:
records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records] # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers] # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2) # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
with open(outfile, "a") as out:
out.write(line)

with open(file) as src:
print("creating slice", sl)
for line in src:
if line_number <= curr_marker:
writeline(outfile, line)
else:
sl = sl+1
curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
print("creating slice", sl)
writeline(outfile, line)
line_number = line_number+1


How to use



Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:



/path/to/slice.py


Notes




  • The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

  • The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 26 '14 at 23:13

























answered Nov 24 '14 at 9:33









Jacob VlijmJacob Vlijm

64.9k9129225




64.9k9129225













  • Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

    – Michael Käfer
    Nov 25 '14 at 12:43











  • This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

    – Michael Käfer
    Nov 26 '14 at 19:54











  • @mischa004 Glad it works! It was fun working on it.

    – Jacob Vlijm
    Nov 26 '14 at 19:55





















  • Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

    – Michael Käfer
    Nov 25 '14 at 12:43











  • This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

    – Michael Käfer
    Nov 26 '14 at 19:54











  • @mischa004 Glad it works! It was fun working on it.

    – Jacob Vlijm
    Nov 26 '14 at 19:55



















Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43





Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43













This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54





This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54













@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55







@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55




















draft saved

draft discarded




















































Thanks for contributing an answer to Ask Ubuntu!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f552184%2fsplit-10gb-text-file-1-with-a-minimum-size-of-the-output-files-of-40mb-and-2-a%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Human spaceflight

Can not write log (Is /dev/pts mounted?) - openpty in Ubuntu-on-Windows?

張江高科駅