Sed script crashing on big file

I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.

&FCI

NORB=280,

NELEC=78,

MS2=0,

UHF=.FALSE.,

ORBSYM=1,1,1,1,1,1,1,1,<...>

&END

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 

ORBSYM=1,1,1,1,1,1,1,1,<...>

ISYM=1,

/

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

This is the script:

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ ${#} -ne 1 ]

then

  echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2

  exit 1

fi



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 2

fi



sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"



exit 0

This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":

script.sh: line 24: 406089 Killed                  sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"

How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.

Extra info:

after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.

discarding the -i flag and redirecting the output to another file gives an empty file.

edited Dec 15 at 1:33

muru

asked Dec 14 at 20:43

Josja

335

2

It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27

add a comment |

I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.

&FCI

NORB=280,

NELEC=78,

MS2=0,

UHF=.FALSE.,

ORBSYM=1,1,1,1,1,1,1,1,<...>

&END

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 

ORBSYM=1,1,1,1,1,1,1,1,<...>

ISYM=1,

/

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

This is the script:

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ ${#} -ne 1 ]

then

  echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2

  exit 1

fi



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 2

fi



sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"



exit 0

This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":

script.sh: line 24: 406089 Killed                  sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"

How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.

Extra info:

after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.

discarding the -i flag and redirecting the output to another file gives an empty file.

edited Dec 15 at 1:33

muru

asked Dec 14 at 20:43

Josja

335

2

It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27

add a comment |

I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.

&FCI

NORB=280,

NELEC=78,

MS2=0,

UHF=.FALSE.,

ORBSYM=1,1,1,1,1,1,1,1,<...>

&END

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 

ORBSYM=1,1,1,1,1,1,1,1,<...>

ISYM=1,

/

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

This is the script:

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ ${#} -ne 1 ]

then

  echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2

  exit 1

fi



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 2

fi



sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"



exit 0

This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":

script.sh: line 24: 406089 Killed                  sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"

How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.

Extra info:

after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.

discarding the -i flag and redirecting the output to another file gives an empty file.

edited Dec 15 at 1:33

muru

asked Dec 14 at 20:43

Josja

335

I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.

&FCI

NORB=280,

NELEC=78,

MS2=0,

UHF=.FALSE.,

ORBSYM=1,1,1,1,1,1,1,1,<...>

&END

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 

ORBSYM=1,1,1,1,1,1,1,1,<...>

ISYM=1,

/

  1.48971678130072078261E+01   1   1   1   1

 -1.91501428271686324756E+00   1   1   2   1

  4.38796949990802698238E+00   1   1   2   2

This is the script:

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ ${#} -ne 1 ]

then

  echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2

  exit 1

fi



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 2

fi



sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"



exit 0

This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":

script.sh: line 24: 406089 Killed                  sed '

1,20 {

   :a; N; $!ba

   s/(=[^,]*,)n/1 /g

   s/(&FCI)n/1 /

   s/ORBSYM/n&/g

   s/&END/ISYM=1,n//

}' -i "${1}"

How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.

Extra info:

after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.

discarding the -i flag and redirecting the output to another file gives an empty file.

command-line text-processing sed

edited Dec 15 at 1:33

muru

asked Dec 14 at 20:43

Josja

335

edited Dec 15 at 1:33

muru

asked Dec 14 at 20:43

Josja

335

edited Dec 15 at 1:33

muru

edited Dec 15 at 1:33

muru

edited Dec 15 at 1:33

muru

asked Dec 14 at 20:43

Josja

335

asked Dec 14 at 20:43

Josja

335

asked Dec 14 at 20:43

Josja

335

2

It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27

add a comment |

2

It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27

It could be that your stack size isn't large enough to handle a file that size in sed. See: gnu.org/software/sed/manual/html_node/Limitations.html To view your stack size run ulimit -s or to see all limits run ulimit -a
– Terrance
Dec 14 at 21:27

add a comment |

2 Answers
2

active

oldest

votes

General method

You can split each file into a header and a second file with the data lines

Then you can easily edit a header separately with your current sed command

Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

You can use head and tail to create a head file and a data file.

You can use cat to concatenate the modified head file and the data file.

Efficient way to print lines from a massive file using awk, sed, or something else?

Another method is to use split

Test

I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
```
$ ls -lh --time-style=full-iso huge*

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```

The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda1       106G   32G   69G  32% /

This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ $# -ne 2 ]

then

  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2

  echo "Example:  $0 file.in file.out" 1>&2

  exit 1

fi



if [ "$1" == "$2" ]

then

  echo "The names of the input file and output file must differ"

  exit 2

exit

fi



endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

if [ "$endheader" == "" ]

then

  echo "Bad input file: the end marker of the header was not found"

  exit 3

fi

#echo "endheader=$endheader"



< "$1" head -n "$endheader" > /tmp/header

#cat /tmp/header



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 4

fi



# run sed inline on /tmp/header 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



if [ $? -ne 0 ]

then

  echo "Failed to convert the header format in /tmp/header"

  exit 5

fi



< "$1" tail -n +$(($endheader+1)) > /tmp/tailer



if [ $? -ne 0 ]

then

  echo "Failed to create the 'data' file /tmp/tailer"

  exit 6

fi



#echo "---"

#cat /tmp/tailer

#echo "---"



cat /tmp/header /tmp/tailer > "$2"



exit 0

edited Dec 15 at 19:21

answered Dec 14 at 22:07

sudodus

22.8k32874

1

Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56

add a comment |

sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:

Extract the Old Header from the giant data file, into a file of its own.

Adjust the extracted Old Header, to make it the New Header.

Replace the Old Header with the New Header in the giant data file.

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

head -n "$endheader" >/tmp/header

trap "/bin/rm -f /tmp/header" EXIT

# do the sed stuff to /tmp/header, I assume it does what you want 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



# Then combine the new header with the rest of the giant data file,

# using `ed` (see `man ed;info Ed`) and here-document

ed "$1" <<EndOfEd

1,${endheader}d

:0r /tmp/header

:wq

EndOfEd

edited yesterday

answered Dec 14 at 22:00

waltinator

21.9k74169

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100963%2fsed-script-crashing-on-big-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

General method

You can split each file into a header and a second file with the data lines

Then you can easily edit a header separately with your current sed command

Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

You can use head and tail to create a head file and a data file.

You can use cat to concatenate the modified head file and the data file.

Efficient way to print lines from a massive file using awk, sed, or something else?

Another method is to use split

Test

I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
```
$ ls -lh --time-style=full-iso huge*

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```

The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda1       106G   32G   69G  32% /

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ $# -ne 2 ]

then

  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2

  echo "Example:  $0 file.in file.out" 1>&2

  exit 1

fi



if [ "$1" == "$2" ]

then

  echo "The names of the input file and output file must differ"

  exit 2

exit

fi



endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

if [ "$endheader" == "" ]

then

  echo "Bad input file: the end marker of the header was not found"

  exit 3

fi

#echo "endheader=$endheader"



< "$1" head -n "$endheader" > /tmp/header

#cat /tmp/header



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 4

fi



# run sed inline on /tmp/header 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



if [ $? -ne 0 ]

then

  echo "Failed to convert the header format in /tmp/header"

  exit 5

fi



< "$1" tail -n +$(($endheader+1)) > /tmp/tailer



if [ $? -ne 0 ]

then

  echo "Failed to create the 'data' file /tmp/tailer"

  exit 6

fi



#echo "---"

#cat /tmp/tailer

#echo "---"



cat /tmp/header /tmp/tailer > "$2"



exit 0

edited Dec 15 at 19:21

answered Dec 14 at 22:07

sudodus

22.8k32874

1

Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56

add a comment |

General method

You can split each file into a header and a second file with the data lines

Then you can easily edit a header separately with your current sed command

Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

You can use head and tail to create a head file and a data file.

You can use cat to concatenate the modified head file and the data file.

Efficient way to print lines from a massive file using awk, sed, or something else?

Another method is to use split

Test

I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
```
$ ls -lh --time-style=full-iso huge*

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```

The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda1       106G   32G   69G  32% /

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ $# -ne 2 ]

then

  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2

  echo "Example:  $0 file.in file.out" 1>&2

  exit 1

fi



if [ "$1" == "$2" ]

then

  echo "The names of the input file and output file must differ"

  exit 2

exit

fi



endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

if [ "$endheader" == "" ]

then

  echo "Bad input file: the end marker of the header was not found"

  exit 3

fi

#echo "endheader=$endheader"



< "$1" head -n "$endheader" > /tmp/header

#cat /tmp/header



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 4

fi



# run sed inline on /tmp/header 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



if [ $? -ne 0 ]

then

  echo "Failed to convert the header format in /tmp/header"

  exit 5

fi



< "$1" tail -n +$(($endheader+1)) > /tmp/tailer



if [ $? -ne 0 ]

then

  echo "Failed to create the 'data' file /tmp/tailer"

  exit 6

fi



#echo "---"

#cat /tmp/tailer

#echo "---"



cat /tmp/header /tmp/tailer > "$2"



exit 0

edited Dec 15 at 19:21

answered Dec 14 at 22:07

sudodus

22.8k32874

1

Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56

add a comment |

General method

You can split each file into a header and a second file with the data lines

Then you can easily edit a header separately with your current sed command

Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

You can use head and tail to create a head file and a data file.

You can use cat to concatenate the modified head file and the data file.

Efficient way to print lines from a massive file using awk, sed, or something else?

Another method is to use split

Test

I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
```
$ ls -lh --time-style=full-iso huge*

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```

The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda1       106G   32G   69G  32% /

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ $# -ne 2 ]

then

  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2

  echo "Example:  $0 file.in file.out" 1>&2

  exit 1

fi



if [ "$1" == "$2" ]

then

  echo "The names of the input file and output file must differ"

  exit 2

exit

fi



endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

if [ "$endheader" == "" ]

then

  echo "Bad input file: the end marker of the header was not found"

  exit 3

fi

#echo "endheader=$endheader"



< "$1" head -n "$endheader" > /tmp/header

#cat /tmp/header



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 4

fi



# run sed inline on /tmp/header 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



if [ $? -ne 0 ]

then

  echo "Failed to convert the header format in /tmp/header"

  exit 5

fi



< "$1" tail -n +$(($endheader+1)) > /tmp/tailer



if [ $? -ne 0 ]

then

  echo "Failed to create the 'data' file /tmp/tailer"

  exit 6

fi



#echo "---"

#cat /tmp/tailer

#echo "---"



cat /tmp/header /tmp/tailer > "$2"



exit 0

edited Dec 15 at 19:21

answered Dec 14 at 22:07

sudodus

22.8k32874

General method

You can split each file into a header and a second file with the data lines

Then you can easily edit a header separately with your current sed command

Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

You can use head and tail to create a head file and a data file.

You can use cat to concatenate the modified head file and the data file.

Efficient way to print lines from a massive file using awk, sed, or something else?

Another method is to use split

Test

I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
```
$ ls -lh --time-style=full-iso huge*

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in

-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```

The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda1       106G   32G   69G  32% /

#!/bin/bash



# $1 : FCIDUMP file to convert from "new format" to "old format"



if [ $# -ne 2 ]

then

  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2

  echo "Example:  $0 file.in file.out" 1>&2

  exit 1

fi



if [ "$1" == "$2" ]

then

  echo "The names of the input file and output file must differ"

  exit 2

exit

fi



endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

if [ "$endheader" == "" ]

then

  echo "Bad input file: the end marker of the header was not found"

  exit 3

fi

#echo "endheader=$endheader"



< "$1" head -n "$endheader" > /tmp/header

#cat /tmp/header



if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null

then

  echo "The provided file is already in old FCIDUMP format." 1>&2

  exit 4

fi



# run sed inline on /tmp/header 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



if [ $? -ne 0 ]

then

  echo "Failed to convert the header format in /tmp/header"

  exit 5

fi



< "$1" tail -n +$(($endheader+1)) > /tmp/tailer



if [ $? -ne 0 ]

then

  echo "Failed to create the 'data' file /tmp/tailer"

  exit 6

fi



#echo "---"

#cat /tmp/tailer

#echo "---"



cat /tmp/header /tmp/tailer > "$2"



exit 0

edited Dec 15 at 19:21

answered Dec 14 at 22:07

sudodus

22.8k32874

edited Dec 15 at 19:21

answered Dec 14 at 22:07

sudodus

22.8k32874

answered Dec 14 at 22:07

sudodus

22.8k32874

answered Dec 14 at 22:07

sudodus

22.8k32874

1

Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56

add a comment |

1

Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56

Thanks for the answer. Concerning the option to use 'spit' I also discovered 'csplit' which is more flexible than 'split'
– Josja
Dec 16 at 21:56

add a comment |

sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:

Extract the Old Header from the giant data file, into a file of its own.

Adjust the extracted Old Header, to make it the New Header.

Replace the Old Header with the New Header in the giant data file.

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

head -n "$endheader" >/tmp/header

trap "/bin/rm -f /tmp/header" EXIT

# do the sed stuff to /tmp/header, I assume it does what you want 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



# Then combine the new header with the rest of the giant data file,

# using `ed` (see `man ed;info Ed`) and here-document

ed "$1" <<EndOfEd

1,${endheader}d

:0r /tmp/header

:wq

EndOfEd

edited yesterday

answered Dec 14 at 22:00

waltinator

21.9k74169

add a comment |

sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:

Extract the Old Header from the giant data file, into a file of its own.

Adjust the extracted Old Header, to make it the New Header.

Replace the Old Header with the New Header in the giant data file.

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

head -n "$endheader" >/tmp/header

trap "/bin/rm -f /tmp/header" EXIT

# do the sed stuff to /tmp/header, I assume it does what you want 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



# Then combine the new header with the rest of the giant data file,

# using `ed` (see `man ed;info Ed`) and here-document

ed "$1" <<EndOfEd

1,${endheader}d

:0r /tmp/header

:wq

EndOfEd

edited yesterday

answered Dec 14 at 22:00

waltinator

21.9k74169

add a comment |

sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:

Extract the Old Header from the giant data file, into a file of its own.

Adjust the extracted Old Header, to make it the New Header.

Replace the Old Header with the New Header in the giant data file.

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

head -n "$endheader" >/tmp/header

trap "/bin/rm -f /tmp/header" EXIT

# do the sed stuff to /tmp/header, I assume it does what you want 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



# Then combine the new header with the rest of the giant data file,

# using `ed` (see `man ed;info Ed`) and here-document

ed "$1" <<EndOfEd

1,${endheader}d

:0r /tmp/header

:wq

EndOfEd

edited yesterday

answered Dec 14 at 22:00

waltinator

21.9k74169

sed is probably NOT the best tool for this, investigate perl. However, you could restate the problem as:

Extract the Old Header from the giant data file, into a file of its own.

Adjust the extracted Old Header, to make it the New Header.

Replace the Old Header with the New Header in the giant data file.

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"

head -n "$endheader" >/tmp/header

trap "/bin/rm -f /tmp/header" EXIT

# do the sed stuff to /tmp/header, I assume it does what you want 

sed '

{

:a; N; $!ba

s/(=[^,]*,)n/1 /g

s/(&FCI)n/1 /

s/ORBSYM/n&/g

s/&END/ISYM=1,n//

}' -i /tmp/header 



# Then combine the new header with the rest of the giant data file,

# using `ed` (see `man ed;info Ed`) and here-document

ed "$1" <<EndOfEd

1,${endheader}d

:0r /tmp/header

:wq

EndOfEd

edited yesterday

answered Dec 14 at 22:00

waltinator

21.9k74169

edited yesterday

answered Dec 14 at 22:00

waltinator

21.9k74169

answered Dec 14 at 22:00

waltinator

21.9k74169

answered Dec 14 at 22:00

waltinator

21.9k74169

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dtyjlui