Work flow automation with bash script?

You have a problem with Salix? Post here and we'll do what we can to help.

Re: Work flow automation with bash script?

Postby globetrotterdk » 27. Feb 2012, 09:56

Shador wrote:Actually you can merge the two steps like this:
Code: Select all
lynx -dump -force-html some_document.html | sed -n 's/^ *[0-9]*\. //p' | fgrep "mailto:" | sed -e 's/\n/, /g' > some_document.txt

That doesn't work for me. The file gets created, but this bit doesn't get implemented for some reason:
Code: Select all
sed -e 's/\n/, /g'
The data I get is with "mailto:" and the e-mail addresses in a column:
mailto:abc@humanrights.dk
mailto:def@bees.com
mailto:ghi@mail.dk
Military justice is to justice what military music is to music. - Groucho Marx
globetrotterdk
 
Posts: 258
Joined: 26. Oct 2010, 13:57
Location: Copenhagen, Denmark

Re: Work flow automation with bash script?

Postby gapan » 27. Feb 2012, 10:02

sed won't work like that. It works on a line-by-line basis, so it never actually parses newline characters. You can use tr instead:
Code: Select all
tr "\n" ", "
Image
User avatar
gapan
Salix Wizard
 
Posts: 3481
Joined: 6. Jun 2009, 17:40

Re: Work flow automation with bash script?

Postby Shador » 27. Feb 2012, 10:46

gapan wrote:sed won't work like that. It works on a line-by-line basis, so it never actually parses newline characters. You can use tr instead:
Code: Select all
tr "\n" ", "

Yes, you're right. Didn't look closely enough. For sed it is:
Code: Select all
 sed -e ':a;N;$!ba;s/\n/, /g'
Image
Shador
Salix Warrior
 
Posts: 1295
Joined: 11. Jun 2009, 14:04
Location: Bavaria

Re: Work flow automation with bash script?

Postby mimosa » 27. Feb 2012, 12:43

The stripping of the "Mailto:" also doesn't seem to be working.

It would be quite easy to write a small sub-script that extracted well-formed email addresses from any file format reasonably close to text, and then glued them together with commas. This would be more robust and capable of accommodating changes in the organisation's workflow (such as if they stop using Google docs).
User avatar
mimosa
 
Posts: 1766
Joined: 25. May 2010, 17:02

Re: Work flow automation with bash script?

Postby gapan » 27. Feb 2012, 14:25

To remove the "mailto:" part, you can run another sed, just after the fgrep "mailto:".
Code: Select all
sed "s/^mailto://"

The ^ might not make any difference, but it's not doing any harm either.
Image
User avatar
gapan
Salix Wizard
 
Posts: 3481
Joined: 6. Jun 2009, 17:40

Re: Work flow automation with bash script?

Postby mimosa » 28. Feb 2012, 00:48

I expect you've solved your problem by now :)

Just for fun, though, here's a Python script I've written to remove the email addresses from their padding and stick them back together again. From the command line, you would do:

./bcc.py some_document.some_format


You should find a file bcc.txt with the cleaned up emails in the directory you called the script from. In this case, your shell script would go something like:
Code: Select all
#! /bin/sh
cd /home/globetrotter/path/to/directory
google docs get ... [whatever] some.document
bcc.py some.document


I haven't tested it much because I don't have a convenient sample. I should also stress that this probably isn't the best Python style, as I'm just starting out with Python. If you want to try it out, put it somewhere in your $PATH (such as /usr/local/bin) and make it executable. :)

http://pastebin.com/kvHSg3DQ
User avatar
mimosa
 
Posts: 1766
Joined: 25. May 2010, 17:02

Re: Work flow automation with bash script?

Postby globetrotterdk » 3. Mar 2012, 23:35

Thanks for the postings. I have to admit that this is a bit over my head. I have been trying to do some research on the issue, but to make matters worse, I have found out that the woman in charge of maintaining the spreadsheet with the member data, can't figure out how to take an e-mail address in a cell and convert it to a "mailto:" hyperlink in Google Docs. I have sent here the necessary documentation and explained how to do it in practice, but it hasn't helped:
http://support.google.com/docs/bin/answer.py?hl=en&answer=44660
Code: Select all
=hyperlink("ab@jura.dk")
This means that I have the immediate problem of trying to extract those e-mail adresses that aren't formatted as "mailto:" hyperlinks from the "lynx html dump. I extracted what I thought were all of the e-mail addresses, only to find out afterwards that half of the e-mail addresses aren't formatted as "mailto:" hyperlinks. This has to be a quick and dirty solution due to time constraints. I have to send out the invitations to the annual general conference, that are still lacking. Any ideas?
Military justice is to justice what military music is to music. - Groucho Marx
globetrotterdk
 
Posts: 258
Joined: 26. Oct 2010, 13:57
Location: Copenhagen, Denmark

Re: Work flow automation with bash script?

Postby mimosa » 4. Mar 2012, 00:07

Did you try my Python script? It's designed to be quite general in that it picks out anything that looks like an email address from surrounding material and then throws the latter away, so it's not limited to the problem as you originally described it. Depending on what that material is, a little tweaking might be needed, or maybe another of the formats the console tool allows you to download in will work better.

Am I right in thinking that all the addresses are currently surrounded by quotation marks?
User avatar
mimosa
 
Posts: 1766
Joined: 25. May 2010, 17:02

Re: Work flow automation with bash script?

Postby globetrotterdk » 4. Mar 2012, 08:00

mimosa wrote:Did you try my Python script? It's designed to be quite general in that it picks out anything that looks like an email address from surrounding material and then throws the latter away, so it's not limited to the problem as you originally described it. Depending on what that material is, a little tweaking might be needed, or maybe another of the formats the console tool allows you to download in will work better.

Am I right in thinking that all the addresses are currently surrounded by quotation marks?

Hi mimosa. I haven't tried your python script yet. I was unsure about two things:
1) If the script worked from the lynx dump file.
2) How the script determines where the e-mail addresses are in the file.

I am unsure as to how the e-mail addresses are surrounded. When I open the "dump" file in Firefox, everything seems to in tables. I have tried opening the file in other editors - Nano, Geany, Vim - but all I get are lines that start like this:
Code: Select all
<!DOCTYPE html>
<html><head><title>database</title>
Military justice is to justice what military music is to music. - Groucho Marx
globetrotterdk
 
Posts: 258
Joined: 26. Oct 2010, 13:57
Location: Copenhagen, Denmark

Re: Work flow automation with bash script?

Postby gapan » 4. Mar 2012, 09:10

It would help a lot if you posted part of that dump. You can edit the contact details before posting, so that the real ones don't get published here.
Image
User avatar
gapan
Salix Wizard
 
Posts: 3481
Joined: 6. Jun 2009, 17:40

PreviousNext

Return to Problems