Split 10GB text file 1) with a minimum size of the output files of 40MB and 2) after a specific string ()

I got a big text file (10GB, .xml, contains over 1 million tags like this: <record>text</record>) which I splitted into parts to use it. But to be able to automate my working process, it is necessary that every part ends with a specific tag: </record>. And it is also necessary that every part has got at least the size of about 40MB.

edited Jan 30 at 19:04

asked Nov 21 '14 at 11:35

Michael Käfer

1308

Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26

I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38

<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

– Michael Käfer
Nov 21 '14 at 15:39

One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26

</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23

|
show 3 more comments

edited Jan 30 at 19:04

asked Nov 21 '14 at 11:35

Michael Käfer

1308

Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26

I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38

<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

– Michael Käfer
Nov 21 '14 at 15:39

One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26

</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23

|
show 3 more comments

edited Jan 30 at 19:04

asked Nov 21 '14 at 11:35

Michael Käfer

1308

split

edited Jan 30 at 19:04

asked Nov 21 '14 at 11:35

Michael Käfer

1308

edited Jan 30 at 19:04

asked Nov 21 '14 at 11:35

Michael Käfer

1308

edited Jan 30 at 19:04

asked Nov 21 '14 at 11:35

Michael Käfer

1308

asked Nov 21 '14 at 11:35

Michael Käfer

1308

asked Nov 21 '14 at 11:35

Michael Käfer

1308

Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26

I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38

<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

– Michael Käfer
Nov 21 '14 at 15:39

One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26

</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23

|
show 3 more comments

Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26

I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38

<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority"> <leader>02379nz a2200517o 4500</leader> <controlfield tag="001">100169600</controlfield> <controlfield tag="003">DE-101</controlfield> <controlfield tag="005">20080405154152.0</controlfield> <controlfield tag="008">900615n||aznnnabbn | aaa |c</controlfield> <datafield tag="024" ind1="7" ind2=" "> <subfield code="a">http://d-nb.info/gnd/100169600</subfield> <subfield code="2">uri</subfield> </datafield> </record>

– Michael Käfer
Nov 21 '14 at 15:39

One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26

</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23

Do you have an idea of the average line length? and does it vary a lot (looking at it from a large scale)?

– Jacob Vlijm
Nov 21 '14 at 13:26

I found line lengths starting from 9 characters, the maximum is about 120 characters. All the records in this .xml file are similar to this example:

– Michael Käfer
Nov 21 '14 at 15:38

<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority">     <leader>02379nz  a2200517o  4500</leader>     <controlfield tag="001">100169600</controlfield>     <controlfield tag="003">DE-101</controlfield>     <controlfield tag="005">20080405154152.0</controlfield>     <controlfield tag="008">900615n||aznnnabbn           | aaa    |c</controlfield>     <datafield tag="024" ind1="7" ind2=" ">       <subfield code="a">http://d-nb.info/gnd/100169600</subfield>       <subfield code="2">uri</subfield>     </datafield>   </record>

– Michael Käfer
Nov 21 '14 at 15:39

<record xmlns="http://www.loc.gov/MARC21/slim" type="Authority">     <leader>02379nz  a2200517o  4500</leader>     <controlfield tag="001">100169600</controlfield>     <controlfield tag="003">DE-101</controlfield>     <controlfield tag="005">20080405154152.0</controlfield>     <controlfield tag="008">900615n||aznnnabbn           | aaa    |c</controlfield>     <datafield tag="024" ind1="7" ind2=" ">       <subfield code="a">http://d-nb.info/gnd/100169600</subfield>       <subfield code="2">uri</subfield>     </datafield>   </record>

– Michael Käfer
Nov 21 '14 at 15:39

One more question: is the last line of a "division" (when cut into pieces) "</record>" or does the last line of a section contain "</record>". The first one would make it much easier, otherwise a line would have to be separated and divided into two sections. That is a handicap when dealing with huge files.

– Jacob Vlijm
Nov 21 '14 at 20:26

</record> is the last line of every division (no characters before and after it). You have got a solution?

– Michael Käfer
Nov 22 '14 at 15:23

|
show 3 more comments

1 Answer
1

active

oldest

votes

The script below slices a (large) file into slices. I didn't use the split command, since the content of your file has to be "rounded" by records. The size of the slices you can set in the head section of the script.

The procedure

Difficulties

Because the script should be able to deal with huge files, either python's read() or readlines() cannot be used; the script would attempt to load the whole file at once into memory, which would certainly choke your system. At the same time, divisions have to be made, "rounding" sections by a whole record. The script should therefore somehow be able to identify or "read" the file's content.

What seems to be the only option is to use:

with open(file) as src:

    for line in src:

which reads the file line by line.

Approach

In the script I chose a two-step approach:

Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

reading the file again, but now allocating the lines to separate files.

The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.

How I tested

I created an xml file of a little over 10GB, filled with records like your example. I set the size of slices to 45mb. On my not-so-recent system (Pentium Dual-Core CPU E6700 @ 3.20GHz × 2), the script's analysis produced the following:

analyzing file...



checking file size...

file size: 10767 mb

calculating number of slices...

239 slices of 45 mb

checking number of lines...

number of lines: 246236399

checking number of records...

number of records: 22386000

calculating number records per section ...

records per section: 93665

Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.

creating slice 1

creating slice 2

creating slice 3

creating slice 4

creating slice 5

and so on...

The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.

The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.

The script

#!/usr/bin/env python3



import os

import time



#---

file = "/path/to/big/file.xml" 

out_dir = "/path/to/save/slices"

size_ofslices = 45 # in mb

identifying_string = "</record>"

#---



line_number = -1

records = [0]



# analyzing file -------------------------------------------



print("analyzing file...n")

# size in mb

print("checking file size...")

size = int(os.stat(file).st_size/1000000)

print("file size:", size, "mb")

# number of sections

print("calculating number of slices...")

sections = int(size/size_ofslices)

print(sections, "slices of", size_ofslices, "mb")

# misc. data

print("checking number of lines...")

with open(file) as src:

    for line in src:

        line_number = line_number+1

        if identifying_string in line:

            records.append(line_number)

# last index (number of lines -1)

ns_oflines = line_number

print("number of lines:", ns_oflines)

# number of records

print("checking number of records...")

ns_records = len(records)-1

print("number of records:", ns_records)

# records per section

print("calculating number records per section ...")

ns_recpersection = int(ns_records/sections)

print("records per section:", ns_recpersection)



# preparing data -------------------------------------------



rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices

line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices

line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line



# creating sections ----------------------------------------



sl = 1

line_number = 0



curr_marker = line_markers[sl]

outfile = out_dir+"/"+"slice_"+str(sl)+".txt"



def writeline(outfile, line):

    with open(outfile, "a") as out:

        out.write(line)



with open(file) as src:

    print("creating slice", sl)

    for line in src:

        if line_number <= curr_marker:

            writeline(outfile, line)

        else:

            sl = sl+1

            curr_marker = line_markers[sl]

            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

            print("creating slice", sl)

            writeline(outfile, line)       

        line_number = line_number+1

How to use

Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:

/path/to/slice.py

Notes

The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).

edited Nov 26 '14 at 23:13

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43

This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54

@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f552184%2fsplit-10gb-text-file-1-with-a-minimum-size-of-the-output-files-of-40mb-and-2-a%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The procedure

What seems to be the only option is to use:

with open(file) as src:

    for line in src:

which reads the file line by line.

Approach

In the script I chose a two-step approach:

Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

reading the file again, but now allocating the lines to separate files.

The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.

analyzing file...



checking file size...

file size: 10767 mb

calculating number of slices...

239 slices of 45 mb

checking number of lines...

number of lines: 246236399

checking number of records...

number of records: 22386000

calculating number records per section ...

records per section: 93665

Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.

creating slice 1

creating slice 2

creating slice 3

creating slice 4

creating slice 5

and so on...

The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.

The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.

The script

#!/usr/bin/env python3



import os

import time



#---

file = "/path/to/big/file.xml" 

out_dir = "/path/to/save/slices"

size_ofslices = 45 # in mb

identifying_string = "</record>"

#---



line_number = -1

records = [0]



# analyzing file -------------------------------------------



print("analyzing file...n")

# size in mb

print("checking file size...")

size = int(os.stat(file).st_size/1000000)

print("file size:", size, "mb")

# number of sections

print("calculating number of slices...")

sections = int(size/size_ofslices)

print(sections, "slices of", size_ofslices, "mb")

# misc. data

print("checking number of lines...")

with open(file) as src:

    for line in src:

        line_number = line_number+1

        if identifying_string in line:

            records.append(line_number)

# last index (number of lines -1)

ns_oflines = line_number

print("number of lines:", ns_oflines)

# number of records

print("checking number of records...")

ns_records = len(records)-1

print("number of records:", ns_records)

# records per section

print("calculating number records per section ...")

ns_recpersection = int(ns_records/sections)

print("records per section:", ns_recpersection)



# preparing data -------------------------------------------



rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices

line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices

line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line



# creating sections ----------------------------------------



sl = 1

line_number = 0



curr_marker = line_markers[sl]

outfile = out_dir+"/"+"slice_"+str(sl)+".txt"



def writeline(outfile, line):

    with open(outfile, "a") as out:

        out.write(line)



with open(file) as src:

    print("creating slice", sl)

    for line in src:

        if line_number <= curr_marker:

            writeline(outfile, line)

        else:

            sl = sl+1

            curr_marker = line_markers[sl]

            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

            print("creating slice", sl)

            writeline(outfile, line)       

        line_number = line_number+1

How to use

Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:

/path/to/slice.py

Notes

The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).

edited Nov 26 '14 at 23:13

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43

This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54

@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55

add a comment |

The procedure

What seems to be the only option is to use:

with open(file) as src:

    for line in src:

which reads the file line by line.

Approach

In the script I chose a two-step approach:

Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

reading the file again, but now allocating the lines to separate files.

The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.

analyzing file...



checking file size...

file size: 10767 mb

calculating number of slices...

239 slices of 45 mb

checking number of lines...

number of lines: 246236399

checking number of records...

number of records: 22386000

calculating number records per section ...

records per section: 93665

Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.

creating slice 1

creating slice 2

creating slice 3

creating slice 4

creating slice 5

and so on...

The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.

The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.

The script

#!/usr/bin/env python3



import os

import time



#---

file = "/path/to/big/file.xml" 

out_dir = "/path/to/save/slices"

size_ofslices = 45 # in mb

identifying_string = "</record>"

#---



line_number = -1

records = [0]



# analyzing file -------------------------------------------



print("analyzing file...n")

# size in mb

print("checking file size...")

size = int(os.stat(file).st_size/1000000)

print("file size:", size, "mb")

# number of sections

print("calculating number of slices...")

sections = int(size/size_ofslices)

print(sections, "slices of", size_ofslices, "mb")

# misc. data

print("checking number of lines...")

with open(file) as src:

    for line in src:

        line_number = line_number+1

        if identifying_string in line:

            records.append(line_number)

# last index (number of lines -1)

ns_oflines = line_number

print("number of lines:", ns_oflines)

# number of records

print("checking number of records...")

ns_records = len(records)-1

print("number of records:", ns_records)

# records per section

print("calculating number records per section ...")

ns_recpersection = int(ns_records/sections)

print("records per section:", ns_recpersection)



# preparing data -------------------------------------------



rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices

line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices

line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line



# creating sections ----------------------------------------



sl = 1

line_number = 0



curr_marker = line_markers[sl]

outfile = out_dir+"/"+"slice_"+str(sl)+".txt"



def writeline(outfile, line):

    with open(outfile, "a") as out:

        out.write(line)



with open(file) as src:

    print("creating slice", sl)

    for line in src:

        if line_number <= curr_marker:

            writeline(outfile, line)

        else:

            sl = sl+1

            curr_marker = line_markers[sl]

            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

            print("creating slice", sl)

            writeline(outfile, line)       

        line_number = line_number+1

How to use

Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:

/path/to/slice.py

Notes

The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).

edited Nov 26 '14 at 23:13

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43

This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54

@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55

add a comment |

The procedure

What seems to be the only option is to use:

with open(file) as src:

    for line in src:

which reads the file line by line.

Approach

In the script I chose a two-step approach:

Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

reading the file again, but now allocating the lines to separate files.

The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.

analyzing file...



checking file size...

file size: 10767 mb

calculating number of slices...

239 slices of 45 mb

checking number of lines...

number of lines: 246236399

checking number of records...

number of records: 22386000

calculating number records per section ...

records per section: 93665

Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.

creating slice 1

creating slice 2

creating slice 3

creating slice 4

creating slice 5

and so on...

The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.

The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.

The script

#!/usr/bin/env python3



import os

import time



#---

file = "/path/to/big/file.xml" 

out_dir = "/path/to/save/slices"

size_ofslices = 45 # in mb

identifying_string = "</record>"

#---



line_number = -1

records = [0]



# analyzing file -------------------------------------------



print("analyzing file...n")

# size in mb

print("checking file size...")

size = int(os.stat(file).st_size/1000000)

print("file size:", size, "mb")

# number of sections

print("calculating number of slices...")

sections = int(size/size_ofslices)

print(sections, "slices of", size_ofslices, "mb")

# misc. data

print("checking number of lines...")

with open(file) as src:

    for line in src:

        line_number = line_number+1

        if identifying_string in line:

            records.append(line_number)

# last index (number of lines -1)

ns_oflines = line_number

print("number of lines:", ns_oflines)

# number of records

print("checking number of records...")

ns_records = len(records)-1

print("number of records:", ns_records)

# records per section

print("calculating number records per section ...")

ns_recpersection = int(ns_records/sections)

print("records per section:", ns_recpersection)



# preparing data -------------------------------------------



rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices

line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices

line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line



# creating sections ----------------------------------------



sl = 1

line_number = 0



curr_marker = line_markers[sl]

outfile = out_dir+"/"+"slice_"+str(sl)+".txt"



def writeline(outfile, line):

    with open(outfile, "a") as out:

        out.write(line)



with open(file) as src:

    print("creating slice", sl)

    for line in src:

        if line_number <= curr_marker:

            writeline(outfile, line)

        else:

            sl = sl+1

            curr_marker = line_markers[sl]

            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

            print("creating slice", sl)

            writeline(outfile, line)       

        line_number = line_number+1

How to use

Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:

/path/to/slice.py

Notes

The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).

edited Nov 26 '14 at 23:13

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

The procedure

What seems to be the only option is to use:

with open(file) as src:

    for line in src:

which reads the file line by line.

Approach

In the script I chose a two-step approach:

Analyze the file (size, number of slices, number of lines, number of records, records per section), then creating a list of sections or "markers" (by line index).

reading the file again, but now allocating the lines to separate files.

The procedure to append the lines to the separate slices (files) one by one seems inefficient, but from all I tried it turned out to be the most efficient, fastest and least consuming option.

analyzing file...



checking file size...

file size: 10767 mb

calculating number of slices...

239 slices of 45 mb

checking number of lines...

number of lines: 246236399

checking number of records...

number of records: 22386000

calculating number records per section ...

records per section: 93665

Then it started creating slices of 45 mb, taking appr. 25-27 seconds per slice to create.

creating slice 1

creating slice 2

creating slice 3

creating slice 4

creating slice 5

and so on...

The processor was occupied for 45-50 % during the process, using ~850-880mb of my memory (of 4GB). The computer was reasonably usable during the process.

The whole procedure took one hour and a half. On a more recent system, it should take substantially less time.

The script

#!/usr/bin/env python3



import os

import time



#---

file = "/path/to/big/file.xml" 

out_dir = "/path/to/save/slices"

size_ofslices = 45 # in mb

identifying_string = "</record>"

#---



line_number = -1

records = [0]



# analyzing file -------------------------------------------



print("analyzing file...n")

# size in mb

print("checking file size...")

size = int(os.stat(file).st_size/1000000)

print("file size:", size, "mb")

# number of sections

print("calculating number of slices...")

sections = int(size/size_ofslices)

print(sections, "slices of", size_ofslices, "mb")

# misc. data

print("checking number of lines...")

with open(file) as src:

    for line in src:

        line_number = line_number+1

        if identifying_string in line:

            records.append(line_number)

# last index (number of lines -1)

ns_oflines = line_number

print("number of lines:", ns_oflines)

# number of records

print("checking number of records...")

ns_records = len(records)-1

print("number of records:", ns_records)

# records per section

print("calculating number records per section ...")

ns_recpersection = int(ns_records/sections)

print("records per section:", ns_recpersection)



# preparing data -------------------------------------------



rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices

line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices

line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line



# creating sections ----------------------------------------



sl = 1

line_number = 0



curr_marker = line_markers[sl]

outfile = out_dir+"/"+"slice_"+str(sl)+".txt"



def writeline(outfile, line):

    with open(outfile, "a") as out:

        out.write(line)



with open(file) as src:

    print("creating slice", sl)

    for line in src:

        if line_number <= curr_marker:

            writeline(outfile, line)

        else:

            sl = sl+1

            curr_marker = line_markers[sl]

            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

            print("creating slice", sl)

            writeline(outfile, line)       

        line_number = line_number+1

How to use

Copy the script into an empty file, set the path to your "big file", the path to the directory to save the slices and the size of the slices. Save it as slice.py and run it by the command:

/path/to/slice.py

Notes

The big file's size should exceed the slice's size at least a few times. How bigger the difference the more reliable the size of the (output) slices will be.

The assumption was made that the average size of records (seen in the bigger picture) is about the same. Looking at the huge amount of data here one would expect that would be an acceptable assumption, but you'll have to check (by looking if there is a big difference in the size of the slices).

edited Nov 26 '14 at 23:13

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

edited Nov 26 '14 at 23:13

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

answered Nov 24 '14 at 9:33

Jacob Vlijm

64.9k9129225

Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43

This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54

@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55

add a comment |

Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43

This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54

@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55

Wow. I studied your answer excitedly. Until tomorrow my system is fully occupied, then I will try and report of course. Thank you so much!

– Michael Käfer
Nov 25 '14 at 12:43

This is exactly what I was looking for. The result is perfect! Having a slightly worse system the script took 101 minutes to execute. Thank you very, very much!!

– Michael Käfer
Nov 26 '14 at 19:54

@mischa004 Glad it works! It was fun working on it.

– Jacob Vlijm
Nov 26 '14 at 19:55

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dtyjlui